5 Useful Tips While Working With Python Scrapy
Today, there are a handful of Python Scrapy tutorials out there. However, I find that most of the tutorials do not cover these low-hanging fruits that can greatly improve the developer experience of web scraping.
In this post, I will be sharing with you some quick and easy tips that you can use when working with Scrapy.
These tips will greatly benefit you as a developer while some tricks here can even help to minimize the load to the websites that you want to scrape from.
TL;DR
- Use HTTPCache during development
- Always use AutoThrottle
- Consume sites’ API whenever available
- Use bulk insert for database write operation in item pipelines
- Wrap your target URL with a proxy (e.g. ScraperAPI)
- Bonus: Colorized logging
Without further ado, let’s get started!
Use HTTPCache
While developing spiders (crawlers), we often check if our spiders are working or not by hitting the web server multiple times for each test.
As our development work is an iterative process, we would indirectly increase the server load for the site that we are scraping.
To avoid such behavior, Scrapy provides a built-in middleware called HttpCacheMiddleware which caches every request made by our spiders along with the related response.
Example
To enable this, simply add the code below to your Scrapy project’s settings.py
# Enable and configure HTTP caching (disabled by default)
HTTPCACHE_ENABLED = True
Ultimately, this is a win-win scenario — our tests will now be much faster while not bombarding the site with requests while testing out.
NOTE: Do remember to configure HTTPCACHE_EXPIRATION_SECS
in production.
AutoThrottle
“The first rule of web scraping — do no harm to the website.”
Typically to reduce the load on the websites that we are scraping, we would configure DOWNLOAD_DELAY
in our settings.py
.
However, as the capability to handle requests can vary across different websites, we can automatically adjust the crawling speed to an optimum level using Scrapy’s extension — AutoThrottle.
AutoThrottle would adjust the delays between requests according to the web server’s load using the calculated latency from a single request.
The delay between requests will be adjusted in such a way that it will not be more than the AUTOTHROTTLE_TARGET_CONCURRENCY
.
Example
To enable autothrottle, just include this in your project’s settings.py
:
# Check out the available settings that this extension provide here!
# AUTOTHROTTLE_ENABLED (disabled by default)
AUTOTHROTTLE_ENABLED = True
Not only that, but you can also reduce the chances of getting blocked by the website! As simple as that.
Use the Site’s API
One of the key challenges to look out for while scraping the web is when the data on the sites are dynamically loaded. While starting, I would only naturally consider using the site’s HTTP API whenever I ran into this particular issue. Not ideal.
Today, many websites have HTTP APIs available for third parties to consume their data without having to scrape the web pages. It is one of the best practices to always use the site’s API whenever available.
Moreover, scraping data directly via the site’s API has a lot of advantages as the data returned are more structured and less likely to change. Another bonus point here is that we can avoid having to deal with those pesky HTML.
Bulk (batch) Insert to Database
As you might already know, we can store the scrapped items in our database using Scrapy’s item pipelines.
For starters, we can easily write a single row of data for each item that we scraped from a website.
As we start to scale up by scraping multiple sites with thousands or more items concurrently, we will soon run into issues with our database write.
To cope with that, we shall use bulk insert (saving).
Example
Here, I am going to provide an example of how you use bulk insert with SQLAlchemy in your Scrapy pipeline:
bulk_insert_mappings
instead of bulk_save_objects
here as it accepts lists of plain Python dictionaries which results in less overhead associated with instantiating mapped objects and assigning state to them, hence it’s faster. Check out the comparison here.The code example above showcases the scenario where you are dealing with Foreign Keys (FK). If the rows to be inserted only refer to a single table, then there is no reason to set return_defaults
as True
(source).
In short, bulk insert is a lot more efficient and faster than row-by-row operations. You can use bulk insert to insert millions of rows in a very short time.
Caveat
One of the challenges of using bulk inserts is that if you try to add too many records at once, the database may lock the table for the duration of the operation. So depending on your application’s need, you might want to consider reducing the size of your batch (i.e. commit once for 10,000 items).
Using Proxy API
While scraping large e-commerce sites such as Amazon, you’ll often find yourself in the need to use proxy services. Instead of building our own proxy infrastructure, the easiest way to do so is to use a proxy API such as Scraper API.
Let’s create a utility function that takes in a URL and turns it into a Scraper API URL:
To use it, simply wrap the URL that you want to scrape with our newly created get_proxy_url
function.
# Example in your spider.py
from example.utils import get_proxy_url
def start_requests(self):
url = "https://ecommerce.example.com/products"
yield scrapy.Request(url=get_proxy_url(url), callback=self.parse)
In my opinion, using Scraper API was a breeze and their free tier allows you to scrape 1,000 web pages per month. Check out their pricing here.
Bonus: Colorized Logging
“How can I have logs that are conveniently conspicuous based on color?”
While developing my Scrapy projects, I often found myself asking that same question over and over again. Introducing colorlog
package, a package that does exactly that!
After installing the package into your virtual environment, simply add the following code below to your settings.py
to enable colorized logging:
Wrap Up
In short, you learned about how you can easily avoid bombarding the websites that you want to scrape by using HTTPCACHE_ENABLED
and also AUTOTHROTTLE_ENABLED
. We must do our part to ensure that our spiders are as non-invasive as possible.
On top of that, we can easily reduce the write operations to our database easily by incorporating bulk inserts in our item pipelines.
That is all I have today and I hope you learned a thing or two from this post!
Cheers, and happy web scraping!