Rodolfo De Nadai

Web Scraping with Scrapy

Posted 632 day(s) ago
   

How to extract data from websites

The following months i’ve been using a lot the Scrapy Framework, and this article is about that... use the scrapy framework to extract relevant data.
Extract data means that we want to take unordered info from one (or more than one) website, parser that and use it for our wishs.
Wikipedia has a good explanation about what i just said above:

“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.
Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.” - from Wikipedia

Well, there’s a lot of tools we can use to extract data from websites, but i find scrapy very good and easy one.

What’s Scrapy?

It’s a framework. A framework that’ll help you extract data. It’s written in python, which means it’s far from great!! hahahahaaha :D
There a lot of companies already using it, which makes the framework even more tested. One client use scrapy for Data Mining, which is one of the stuff you could do, or perhaps just scrap the photos of a site you want to!

How to use...
To use scrapy there’s some basics stuff to do, for the simplicity of this article, im using GNU/Linux, in particular Linux Mint.
Since linux is great! Python is pre-install on the Mint distro, so, no need to run any kind of installation procedure.
Scrapy is a third party framework, so we need to install, i recommend the use of pip to install python packages. If you don’t know pip, take a few minutes to understand how it works, and the wonderfull it can do for you.
To install pip on the system (if already haven’t install yet) use synaptic (you can manually install it if you want, RTFM for that) to search for it’s package, after the installation type on the shell to install Scrapy:

$ pip install scrapy

This way, the framework will install and be ready to use. BeautifulSoup4 is another great tool to handle HTML, it can parse documents and access items in easy way, good tool to make some pos-processing on items that Scrapy collect and record on a database.
To install it type:

$ pip install beautifulsoup4

For fast example, i’ll do the scrapping of the news from the website of PMC (Prefeitura Municipal de Campinas).
Let's create a default scrapy project to do the job, you can find info about this in the docs.
To start a new Scrapy project from the command line do:

$ scrapy startproject noticias_pmc

A estructure will be created on the folder where you execute the command (in this case a folder is created in the root directory of the default logged user).

noticias_pmc/
    scrapy.cfg
    noticias_pmc/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
             __init__.py

scrapy.cfg: the project configuration file
noticias_pmc/: the project’s python module, you’ll later import your code from here.
noticias_pmc/items.py: the project’s items file.
noticias_pmc/pipelines.py: the project’s pipelines file.
noticias_pmc/settings.py: the project’s settings file.
noticias_pmc/spiders/: a directory where you’ll later put your spiders.

The first step is define a estructure item (the information the we want to extract and put on use).
Open the file noticias_pmc/items.py:

from scrapy.item import Item, Field

class NewsPmcItem(Item):
    title = Field()
    data = Field()
    text = Field()
    image_urls = Field()
    images = Field()

Done! Now let’s build out our spider!
To do that, inside the folder noticias_pmc/spiders/ create a file named NewsPMCSpider.py. Noticed that Scrapy use the xpath syntax to locate elements inside the parser HTML. BeautifulSoup4 can use others means to access those elements.

# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from noticias_pmc.items import NoticiasPmcItem
import urlparse

# Class, CrawlSpider is the super class (in python we do this way) 
class NewsPMCSpider(CrawlSpider):
    # name our spider
    name = 'noticias_pmc'
    # allowed domains, we don’t want the spider to read the entire web, do we??
    allowed_domains = ['campinas.sp.gov.br']
    # wich url we should start the read
    start_urls = ['http://campinas.sp.gov.br/noticias.php']
    # Rules, which urls format we sould read, the callback that will parse the response and follow to tell our spider to keep going to another urls!
    rules = (
        # Extract links and parse them with the spiders method parse_item
        Rule(SgmlLinkExtractor(allow=['http://campinas.sp.gov.br/noticias.php', 'http://campinas.sp.gov.br/noticias-integra.php']), callback='parse_item', follow=True),
    )

    # This do all the work
    def parse_item(self, response):
        # Create a news items!
        item = NewsPmcItem()
        # Parse the response of the server, so we can access the elements
        hxs = HtmlXPathSelector(response)
        # xpath to find and get the elments, ah! we want only the string of the text here (no html tags!)
        titulo = hxs.select('//div[@class="itens"]/h3').select('string()').extract()
        # If there’s a title, it may be a valide news!
            if titulo:
                # Get the news date
                data = hxs.select('//div[@class="itens"]/p[@class="data"]').select('string()').extract()
               # The body text
               texto = hxs.select('//div[@class="itens"]/p[@align="justify"]').select('string()').extract()
               # Clean up
               item['titulo'] = titulo[0].strip()
               item['data'] = data[0].strip()
               item['texto'] = "".join(texto).strip()
               # Make the parser of images that scrapy will save automatic on the folder defined on settings.py
               item['image_urls'] = self.parse_imagens(response.url, hxs.select('//div[@id="sideRight"]/p/a/img'))
        return item

    def parse_imagens(self, url, imagens):
        image_urls = []
        for imagem in imagens:
            try:
                # Image path
                src = imagem.select('@src').extract()[0]
                # If it is a relative path we must put the prefix http://www.campinas.sp.gov.br before the link
                if 'http' not in src:
                    src = urlparse.urljoin(url, src.strip())
                    image_urls.append(src)
            except:
                pass
        return image_urls

Before running our spider, we must change two other files, the settings.py and pipelines.py.
Add the following line on settings.py (no matter where):

# Nome da classe no arquivo de pipilines que irá fazer o parser das imagens
ITEM_PIPELINES = ['noticias_pmc.pipelines.MyImagesPipeline', ]
# O diretório no qual as imagens serão armazenadas
IMAGES_STORE = '<caminho interno>/noticias_pmc/images'

And on pipelines.py paste the MyImagesPipeline class:

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        try:
            if item['image_urls']:
                for image_url in item['image_urls']:
                    yield Request(image_url)
        except:
            pass

    def item_completed(self, results, item, info):
        item['image_urls'] = [{'url': x['url'], 'path': x['path']} for ok, x in results if ok]
        return item

Done again! Let’s now run our spider an see the results.
Inside the Scrapy project folder, type:

$ scrapy crawl noticias_pmc

Look!!! a spider on the web.... hahahahahahahhahahaha

If you have some problem using Scrapy, leave a message, if i could help, i will!!!
There’s much more on Scrapy documentation take a minute (more than one) and read it!!!


 Data python