Scraping Websites With Python Scrapy Spiders

Scraping Websites With Python Scrapy Spiders

Web scraping is the automated process of extracting large amounts of data from websites. It enables the quick gathering of information for research, monitoring, data analytics, and more.

This guide covers the basics of using Scrapy, a popular Python web scraping framework, to build a robust scraper for harvesting data.

Why Web Scraping is Useful

Some practical use cases include:
Price Monitoring - Track prices across e-commerce stores for market intelligence or optimization algorithms. Updates are more efficient than manual checking.

Lead Generation - Harvest business listings data, including contact information for sales and marketing teams. Easier than buying marketing lists.

Content Monitoring - Check sites constantly for relevant content changes like company announcements or news about competitors. Great for reputation management and PR teams who must stay on top of industry news.

Building Machine Learning Datasets - Web pages can provide volumes of high-quality, niche text and data. With some cleansing, web data is great training material for NLP and ML models.

Web Analytics/SEO Monitoring - Understand visitor behavior through clickstream data and search rankings. Measuring manually would be very error-prone.

Why Scrapy?

There are many Python frameworks available, but Scrapy has some advantages:

  • Battle-tested with over a decade of development and large production deployments

  • Integrates well with all major Python data science and web frameworks

  • Fast and built to scale via its asynchronous architecture

  • A broad ecosystem of plugins and integrations available

With responsible implementation, Scrapy can speed up data harvesting significantly while requiring minimal infrastructure resources.

Building a Product Scraper with Scrapy

First, install Scrapy:

pip install scrapy

Then, generate a new Scrapy project called myscraper:

scrapy startproject myscraper

This creates a bunch of boilerplate code and folders. Key files and folders:

  • myscraper/spiders - Where our spiders will live

  • myscraper/items.py - For defining scraped data schemas

  • myscraper/settings.py - Settings for our scraper

Define Items To Scrape
Open myscraper/items.py. We'll define our product item schema:

    import scrapy

    class ProductItem(scrapy.Item):
        title = scrapy.Field()
        price = scrapy.Field()
        stock = scrapy.Field()
        description = scrapy.Field()

The ProductItem class represents the schema of data we aim to scrape.

Write Spider Scraping Logic

Navigate to myscraper/spiders and create a new file called products.py with our spider code:

    import scrapy
    from ..items import ProductItem

    class ProductSpider(scrapy.Spider):

        name = 'products'

        start_urls = [
            'https://example.com/products'
        ]

        def parse(self, response):
            # Follow links to product pages
            for href in response.css('.product-listing a::attr(href)'): 
                full_url = response.urljoin(href.extract())
                yield scrapy.Request(full_url, callback=self.parse_product)


        def parse_product(self, response):
            item = ProductItem()

            # Extract product data
            item['title'] = response.css('.product-title::text').get() 
            item['price'] = response.css('.price::text').get()
            item['stock'] = response.css('.availability::text').get()        
            item['description'] = response.css('.product-desc::text').get()

            yield item

This spider will:

  • Start scraping from example.com/products

  • Follow links to individual product URLs

  • Extract data into our defined item schema

  • Yield the scraped items

Store Scraped Data
By default, Scrapy supports exporting scraped items to JSON, JSON Lines, CSV and XML formats. Just add flags to the crawl command:

scrapy crawl products -o products.json

You can also createitem pipelines to store data in databases or S3 instead.

Configure Settings
Open up myscraper/settings.py to tweak defaults. Important ones:

  • THROTTLE settings to avoid overwhelming sites

  • CONCURRENT_REQUESTS to control number of requests

  • DOWNLOAD_DELAY to add delays between requests

Run Spider

Navigate to myscraper folder and run:

scrapy crawl products

This guide covered the fundamentals - there are many additional features to explore! Please scrape ethically.

Conclusion

As shown in the guide above, with just a few dozen lines of code, we can set up scalable spiders in Scrapy that crawl across websites, harvesting relevant data quickly and efficiently.

Scrapy requires learning Python and understanding CSS selectors but pays off with flexible, maintainable, and rapid web harvesting capabilities. Robust plugin support also makes it highly extensible for advanced use cases.

When utilized responsibly by carefully considering third-party terms of service and scaling rate limits appropriately, Scrapy is an indispensable tool for gathering web data. The scenarios are endless - price monitoring, web analytics, lead generation, machine learning data sourcing, and more.

Explore the Scrapy documentation for additional example spiders and tutorials covering the breadth of functionality available in this mature web scraping framework.

Please note that everything here is basically for educational purposes and not to be used without proper permission from the host site you are scrapping from.

If you like my work and want to help me continue dropping content like this, buy me a cup of coffee.

If you find this post exciting, find more exciting posts on Learnhub Blog; we write everything tech from Cloud computing to Frontend Dev, Cybersecurity, AI, and Blockchain.

Resource