Building a scalable web crawler is understandably no small task. The fact is that crawling at scale presents several challenges, including traversing multiple pages and optimizing performance.
But no worries. We'll guide you through building your own efficient Python web crawler that scales effortlessly.
- Building your first Python web crawler.
- Optimizing your Python web crawler.
- Avoiding blocks during web crawling.
- Web crawling tools for Python.
- Web crawling best practices for Python.
What Is a Web Crawler in Python?
A web crawler navigates pages, discovers links, and follows them, usually to extract specific information. You can also think of it as a search engine, which searches the web to find links.
Web crawling and scraping share some similarities, though. However, their operations differ slightly. While a web crawler discovers and indexes many web pages, a web scraper extracts and stores specific data from those pages.
If you want to create a production-ready web crawler, you'll follow a few steps that let you add new URLs to your crawl queue continuously. That said, you also want your web crawler to avoid common problems like link duplication, endless crawling, latency, IP bans, etc.
To learn more, check out our in-depth guide on Web Scraping vs. Web Crawling.
Now that you know what a web crawler is, it's time to build one.
Build Your First Python Web Crawler
We'll now crawl this E-commerce Challenge website. The site has many pages. And these include links to paginated products, shopping carts, category pages, and more. Your web crawler will follow some of these links and extract specific product details, including product names, prices, URLs, and image links.
See what the page looks like below:

Before jumping into the tutorial, let's start with the requirements.
Prerequisites for Python Web Crawling
You'll need the following tools to get started:
- Python 3+: Download the latest version, install it, and add it to your system's path.
- A Python IDE: Visual Studio Code with the Python extension or PyCharm Community Edition will do.
You'll also need an HTTP client and an HTML parser for web crawling. The two most popular Python packages for that are:
- Requests: A powerful HTTP client library that facilitates the execution of HTTP requests and handles their responses.
- Beautiful Soup: A full-featured HTML and XML parser that exposes a complete API to explore the DOM, select HTML elements, and retrieve data from them.
Install them both using the following command:
pip3 install beautifulsoup4 requests
Great! You're ready to begin.
The following sections provide a step-by-step guide to building your first web crawler in Python.
Step 1: Follow all the Links on a Website
In this step, you'll design your crawling logic to discover and follow all the links on the target website. Before that, you'll make a simple request to the target website.
Create a crawler
function with a request to the target page. Behind the scenes, the get
method performs an HTTP GET request to the specified URL and returns its full-page HTML:
# pip3 install requests
import requests
# request the target URL
def crawler():
response = requests.get("https://www.scrapingcourse.com/ecommerce/")
response.raise_for_status()
print(response.text)
# execute the crawler
crawler()
The above is a simple GET request to the target website that only returns its HTML as a response:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
<ul class="products columns-4">
<!--- ... --->
</ul>
</body>
</html>
Let's modify the current code to a full-blown web crawler.
Add Beautiful Soup to your imports and set the target URL as the initial link in the urls_to_visit
list.
# pip3 install requests beautifulsoup4
# ...
from bs4 import BeautifulSoup
target_url = "https://www.scrapingcourse.com/ecommerce/"
# initialize the list of discovered URLs
urls_to_visit = [target_url]
Modify the crawler
function to run continuously with a while
loop as long as the urls_to_visit
isn't empty. Get the next link from the last index of the URL list and parse the website's HTML with Beautiful Soup. Extract all href
attributes from a
tags and convert them to absolute URLs. Then, update the urls_to_visit
list with newly discovered links.
Here's the modified code:
# ...
def crawler():
while urls_to_visit:
# get the page to visit from the list
current_url = urls_to_visit.pop(0)
# ...
# parse the HTML
soup = BeautifulSoup(response.text, "html.parser")
# collect all the links
link_elements = soup.select("a[href]")
for link_element in link_elements:
url = link_element["href"]
# convert links to absolute URLs
if not url.startswith("http"):
absolute_url = requests.compat.urljoin(target_url, url)
else:
absolute_url = url
# ensure the crawled link belongs to the target domain and hasn't been visited
if (
absolute_url.startswith(target_url)
and absolute_url not in urls_to_visit
):
urls_to_visit.append(url)
The code might run an endless crawl at this point because it doesn't specify a depth limit. Setting a depth limit ensures the crawler stops after reaching a predefined crawl depth, preventing excessive or infinite crawling.
To apply a depth limit, set the maximum crawl length to 20. Then, use a crawl counter to track the crawl depth. Update the while
loop logic with these changes as shown:
# set a maximum crawl limit
max_crawl = 20
def crawler():
# set a crawl counter to track the crawl depth
crawl_count = 0
while urls_to_visit and crawl_count < max_crawl:
# ...
# ... crawling logic
# update the crawl counter
crawl_count += 1
Finally, print the urls_to_visit
list and execute the crawler
function:
# ...
def crawler():
# ...
# print the crawled URLs
print(urls_to_visit)
# execute the crawl
crawler()
Combine the snippets at this point, and you'll get the following complete and updated code:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
target_url = "https://www.scrapingcourse.com/ecommerce/"
# initialize the list of discovered URLs
urls_to_visit = [target_url]
# set a maximum crawl limit
max_crawl = 20
def crawler():
# set a crawl counter to track the crawl depth
crawl_count = 0
while urls_to_visit and crawl_count < max_crawl:
# get the page to visit from the list
current_url = urls_to_visit.pop()
# request the target URL
response = requests.get(current_url)
response.raise_for_status()
# parse the HTML
soup = BeautifulSoup(response.text, "html.parser")
# collect all the links
link_elements = soup.select("a[href]")
for link_element in link_elements:
url = link_element["href"]
# convert links to absolute URLs
if not url.startswith("http"):
absolute_url = requests.compat.urljoin(target_url, url)
else:
absolute_url = url
# ensure the crawled link belongs to the target domain and hasn't been visited
if (
absolute_url.startswith(target_url)
and absolute_url not in urls_to_visit
):
urls_to_visit.append(absolute_url)
# update the crawl count
crawl_count += 1
# print the crawled URLs
print(urls_to_visit)
# execute the crawl
crawler()
The above code outputs all the URLs for the first 20 crawls:
[
'https://www.scrapingcourse.com/ecommerce/?add-to-cart=2740',
'https://www.scrapingcourse.com/ecommerce/product/ajax-full-zip-sweatshirt/',
// ... omitted for brevity,
'https://www.scrapingcourse.com/ecommerce/page/1/',
'https://www.scrapingcourse.com/ecommerce/page/5/',
// ... omitted for brevity,
]
Great! It's time to filter the above links and extract data from specific product pages.
Step 2: Extract Data From the Crawler
Once you have the links, the next step is to extract data from selected pages (specifically product pages). To achieve this, you'll modify your web spider to scrape product data only when the currently crawled URL points to a product page.
First, you need to understand the format of the product page URL. This will help you create a solid URL regex that distinguishes product links from others.
Open the target site via a browser and navigate to the second product page (page 2). You'll see it has the following URL, with the path parameter /page/2
:
https://www.scrapingcourse.com/ecommerce/page/2
The page number in the path parameter changes dynamically as you navigate the paginated product pages.
Use Python's regex to match this path parameter. Then, create an empty list (product_data
) to collect scraped data:
# ...
import re
# ...
# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")
# define a list to collect scraped data
product_data = []
# ... crawler function
Implement the scraping logic only if a URL's path matches the defined regex. Finally, append the extracted data to the product_data
list and print it after executing the crawler:
# ...
def crawler():
# ...
while urls_to_visit and crawl_count < max_crawl:
# ...
# extract content only if the current URL matches the regex pattern
if url_pattern.search(current_url):
# get the parent element
product_containers = soup.find_all("li", class_="product")
# scrape product data
for product in product_containers:
data = {
"Url": product.find("a", class_="woocommerce-LoopProduct-link")[
"href"
],
"Image": product.find("img", class_="product-image")["src"],
"Name": product.find("h2", class_="product-name").text,
"Price": product.find("span", class_="price").text,
}
# append extracted data
product_data.append(data)
# ...
# ...
# print the extracted data
print(product_data)
Here's what the full code looks like at this point:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
import re
import csv
target_url = "https://www.scrapingcourse.com/ecommerce/"
# initialize the list of discovered URLs
urls_to_visit = [target_url]
# set a maximum crawl limit
max_crawl = 20
# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")
# define a list to collect scraped data
product_data = []
def crawler():
# set a crawl counter to track the crawl depth
crawl_count = 0
while urls_to_visit and crawl_count < max_crawl:
# get the page to visit from the list
current_url = urls_to_visit.pop(0)
# request the target URL
response = requests.get(current_url)
response.raise_for_status()
# parse the HTML
soup = BeautifulSoup(response.content, "html.parser")
# collect all the links
for link_element in soup.find_all("a", href=True):
url = link_element["href"]
# convert links to absolute URLs
if not url.startswith("http"):
absolute_url = requests.compat.urljoin(target_url, url)
else:
absolute_url = url
# ensure the crawled link belongs to the target domain and hasn't been visited
if (
absolute_url.startswith(target_url)
and absolute_url not in urls_to_visit
):
urls_to_visit.append(absolute_url)
# extract content only if the current URL matches the regex page pattern
if url_pattern.search(current_url):
# get the parent element
product_containers = soup.find_all("li", class_="product")
# scrape product data
for product in product_containers:
data = {
"Url": product.find("a", class_="woocommerce-LoopProduct-link")[
"href"
],
"Image": product.find("img", class_="product-image")["src"],
"Name": product.find("h2", class_="product-name").get_text(),
"Price": product.find("span", class_="price").get_text(),
}
# append extracted data
product_data.append(data)
# update the crawl count
crawl_count += 1
# execute the crawl
crawler()
# print the extracted data
print(product_data)
The code outputs the scraped data as shown:
[
{
"Url": "https://www.scrapingcourse.com/ecommerce/product/atlas-fitness-tank/",
"Image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mt11-blue_main.jpg",
"Name": "Atlas Fitness Tank",
"Price": "$18.00",
},
# ...omitted for brevity
{
"Url": "https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/",
"Image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main.jpg",
"Name": "Zoltan Gym Tee",
"Price": "$29.00",
},
]
Your Python web crawler now scrapes data specifically from product URLs. Let's export the extracted data to a CSV in the next section.
Step 3: Extract Data into CSV
Data storage allows you to persist and share the scraped information for further analysis, referencing, and more. You can save the data in JSON, CSV, or a local or remote database. However, for simplicity, we'll use a CSV in this article.
To store the scrape data, import Python's built-in csv
package and specify the CSV file path. Then, write each entry to a new row:
# ...
import csv
# ...
# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
writer.writeheader()
writer.writerows(product_data)
Combine all the snippets from the previous sections with this one. Here's the full code:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
import re
import csv
target_url = "https://www.scrapingcourse.com/ecommerce/"
# initialize the list of discovered URLs
urls_to_visit = [target_url]
# set a maximum crawl limit
max_crawl = 20
# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")
# define a list to collect scraped data
product_data = []
def crawler():
# set a crawl counter to track the crawl depth
crawl_count = 0
while urls_to_visit and crawl_count < max_crawl:
# get the page to visit from the list
current_url = urls_to_visit.pop(0)
# request the target URL
response = requests.get(current_url)
response.raise_for_status()
# parse the HTML
soup = BeautifulSoup(response.content, "html.parser")
# collect all the links
for link_element in soup.find_all("a", href=True):
url = link_element["href"]
# convert links to absolute URLs
if not url.startswith("http"):
absolute_url = requests.compat.urljoin(target_url, url)
else:
absolute_url = url
# ensure the crawled link belongs to the target domain and hasn't been visited
if (
absolute_url.startswith(target_url)
and absolute_url not in urls_to_visit
):
urls_to_visit.append(absolute_url)
# extract content only if the current URL matches the regex page pattern
if url_pattern.search(current_url):
# get the parent element
product_containers = soup.find_all("li", class_="product")
# scrape product data
for product in product_containers:
data = {
"Url": product.find("a", class_="woocommerce-LoopProduct-link")[
"href"
],
"Image": product.find("img", class_="product-image")["src"],
"Name": product.find("h2", class_="product-name").get_text(),
"Price": product.find("span", class_="price").get_text(),
}
# append extracted data
product_data.append(data)
# update the crawl count
crawl_count += 1
# execute the crawl
crawler()
# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
writer.writeheader()
writer.writerows(product_data)
Run the script, and the products.csv
file will appear in your project's folder:

That's it! You just built your first Python web crawler.
That said, you still need to optimize your crawler to make it production-ready. Let's see that next!
Optimize Your Python Web Crawler
The current web crawling script follows links and extracts specific content. However, it needs to handle limitations like duplicate crawls, priorities and retries more efficiently. This section will explain how to optimize it.
Avoid Duplicate Links
Although the current crawler validates the presence of new links in the URL list (urls_to_visit
), this method alone doesn't filter duplicates efficiently.
The best way to prevent duplicates is to track crawled links using a pre-filtered data type like a set
. In addition to automatically filtering duplicates, Python's set uses a hash table, providing faster lookup with minimal delay. This results in significantly quicker duplicate detection and contributes to improved performance.
To handle duplicates with Python's Sets, update the previous crawling script to track visited URLs with the set
function. Check if the crawled URL is already stored in the visited_urls
set. If true, skip that URL and add the next one to the set:
# ...
# create a set to track visited URLs
visited_urls = set()
def crawler():
# ...
while urls_to_visit and crawl_count < max_crawl:
# ...
if current_url in visited_urls:
continue
# add the current URL to the URL set
visited_urls.add(current_url)
# collect all the links
for link_element in soup.find_all("a", href=True):
# ...
# ensure the crawled link belongs to the target domain and hasn't been visited
if url.startswith(target_url) and url not in visited_urls:
# ...
# ... scraping logic
# ...
These improvements help your crawler efficiently avoid duplicate visits to previously discovered links.
Add a Priority Queue
Priority queues let you prioritize specific URLs over others during the crawling process.
In this case, your spider will prioritize product page URLs. This technique can even extend the amount of data you can scrape, as your crawler now traverses product pages before other pages.
To prioritize specific pages, you can use Python's built-in Queue package from PriorityQueue
Import the Queue library. Then, create high and low-priority queues and replace the previous urls_to_visit
list with these queues. If the current URL matches a product page path, append it to the high-priority queue. Otherwise, push it to the low-priority queue:
# ...
from queue import Queue
# ...
# instantiate the queues
high_priority_queue = Queue()
low_priority_queue = Queue()
# create priority queues
high_priority_queue.put(target_url)
low_priority_queue.put(target_url)
# ...
def crawler():
# ...
while (
not high_priority_queue.empty() or not low_priority_queue.empty()
) and crawl_count < max_crawl:
# update the priority queue
if not high_priority_queue.empty():
current_url = high_priority_queue.get()
elif not low_priority_queue.empty():
current_url = low_priority_queue.get()
else:
break
# ...
# collect all the links
for link_element in soup.find_all("a", href=True):
# ...
# ensure the crawled link belongs to the target domain and hasn't been visited
if absolute_url.startswith(target_url) and absolute_url not in visited_urls:
# prioritize product pages
if url_pattern.search(absolute_url):
high_priority_queue.put(absolute_url)
else:
low_priority_queue.put(absolute_url)
# ... scraping logic
# ...
Great! Since you've now prioritized product page URLs, your crawler has a higher chance of reaching more of those pages.
Apply Request Retries and Delays
Request retries allow you to check and resend failed requests. While implementing retries, adding random or exponential delays is a good practice to avoid overloading the server or risking an IP ban. An exponential delay increases the wait time exponentially for each failed request.
The easiest way to implement a retry logic with exponential backoff is to use Tenacity, a Python library for handling retries. Install it using pip
:
pip3 install tenacity
Next, import the library and add it as a retry
decorator in a new fetch_url
function. The retry
decorator lets you specify the maximum number of retries and backoff parameters. Since the fetch_url
function now handles requests, ensure you use it in place of the previous request code block:
# pip3 install requests beautifulsoup4 tenacity
# ...
from tenacity import retry, stop_after_attempt, wait_exponential
# ...
# implement a request retry mechanism
@retry(
stop=stop_after_attempt(4), # maximum number of retries
wait=wait_exponential(multiplier=5, min=4, max=5), # exponential backoff
)
def fetch_url(url):
response = requests.get(url)
response.raise_for_status()
return response
def crawler():
# ...
while (high_priority_queue or low_priority_queue) and crawl_count < max_crawl:
# ...
# request the target URL
response = fetch_url(current_url)
# ...
# ...
With the above modifications, your crawler will now retry failed requests automatically.
Maintain a Single Crawl Session
Sessions let you reuse connections, reducing redundant TCP handshakes and persisting headers and cookies across pages. This can reduce network overheads and improve overall performance.
You can implement a session with the Session
object from the Requests library. Create a new session
instance and use it to update the fetch_url
function like so:
# ...
# initialize session
session = requests.Session()
# ...
def fetch_url(url):
response = session.get(url)
# ...
Splendid! Your Python web crawler now manages HTTP requests with a session.
Combine the snippets from each section to see what the final code looks like:
# pip3 install requests beautifulsoup4 tenacity
import requests
from bs4 import BeautifulSoup
import re
import csv
from tenacity import retry, stop_after_attempt, wait_exponential
from queue import Queue
target_url = "https://www.scrapingcourse.com/ecommerce/"
# instantiate the queues
high_priority_queue = Queue()
low_priority_queue = Queue()
# create priority queues
high_priority_queue.put(target_url)
low_priority_queue.put(target_url)
# create a set to track visited URLs
visited_urls = set()
# set a maximum crawl limit
max_crawl = 20
# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")
# define a list to collect scraped data
product_data = []
# initialize session
session = requests.Session()
# implement a request retry
@retry(
stop=stop_after_attempt(4), # maximum number of retries
wait=wait_exponential(multiplier=5, min=4, max=5), # exponential backoff
)
def fetch_url(url):
response = session.get(url)
response.raise_for_status()
return response
def crawler():
# set a crawl counter to track the crawl depth
crawl_count = 0
while (
not high_priority_queue.empty() or not low_priority_queue.empty()
) and crawl_count < max_crawl:
# update the priority queue
if not high_priority_queue.empty():
current_url = high_priority_queue.get()
elif not low_priority_queue.empty():
current_url = low_priority_queue.get()
else:
break
if current_url in visited_urls:
continue
# add the current URL to the URL set
visited_urls.add(current_url)
# request the target URL
response = fetch_url(current_url)
# parse the HTML
soup = BeautifulSoup(response.content, "html.parser")
# collect all the links
for link_element in soup.find_all("a", href=True):
url = link_element["href"]
# convert links to absolute URLs
if not url.startswith("http"):
absolute_url = requests.compat.urljoin(target_url, url)
else:
absolute_url = url
# ensure the crawled link belongs to the target domain and hasn't been visited
if absolute_url.startswith(target_url) and absolute_url not in visited_urls:
# prioritize product pages
if url_pattern.search(absolute_url):
high_priority_queue.put(absolute_url)
else:
low_priority_queue.put(absolute_url)
# extract content only if the current URL matches the regex page pattern
if url_pattern.search(current_url):
# get the parent element
product_containers = soup.find_all("li", class_="product")
# scrape product data
for product in product_containers:
data = {
"Url": product.find("a", class_="woocommerce-LoopProduct-link")[
"href"
],
"Image": product.find("img", class_="product-image")["src"],
"Name": product.find("h2", class_="product-name").get_text(),
"Price": product.find("span", class_="price").get_text(),
}
# append extracted data
product_data.append(data)
# update the crawl count
crawl_count += 1
# execute the crawl
crawler()
# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
writer.writeheader()
writer.writerows(product_data)
Bravo! You just built a robust web crawler in Python. It doesn't stop here. You still need to deal with edge cases like advanced anti-bot measures, parallel crawling, JavaScript rendering, and more.
We'll explore a solution that can handle all these in the next section.
Avoid Getting Blocked While Crawling With Python
The biggest challenge of web crawling in Python is getting blocked. Many sites use anti-bot measures to identify and stop automated access, preventing you from getting your desired data.
To avoid getting blocked during web crawling, you need to apply mitigation measures. Some strategies include setting up a custom User Agent, crawling during off-peak hours, or using proxies. However, these tips are generally more effective in simple scenarios and are less effective against complex anti-bot measures.
The easiest and most reliable way to bypass anti-bot measures is to use a web scraping API, such as the ZenRows' Universal Scraper API.
ZenRows provides a seamless crawling experience with a single API call. It offers premium rotating proxies to prevent IP bans and cookie support for persisting connections across multiple requests. With support for JavaScript rendering and advanced anti-bot auto-bypass, you can simulate human interactions on the go and forget about getting blocked even while crawling the most protected websites.
Let's see how ZenRows works by scraping the Anti-bot Challenge page, a heavily protected website.
Sign up on ZenRows to open the Request Builder. Paste your target URL in the link box and activate Premium Proxies and JS Rendering.

Then, select Python as your programming language and choose the API connection mode. Copy the generated Python code and paste it into your scraper.
The generated Python code should look like this:
# pip install requests
import requests
url = "https://www.scrapingcourse.com/antibot-challenge"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
The above code bypasses the anti-bot challenge and outputs the protected site's full-page HTML:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations!!🎉 You just bypassed an anti-bot measure with the ZenRows' Universal Scraper API.
Web Crawling Tools for Python
There are several helpful web crawling tools to make discovering links and visiting pages easier. Here's a list of some of the best Python web crawling tools that can assist you:
- ZenRows: A comprehensive scraping and crawling solution for bypassing CAPTCHAs and other anti-bots. It features rotating proxies, geo-localization, JavaScript rendering and advanced AI anti-bot auto-bypass.
- Scrapy: One of the most powerful Python crawling libraries for beginners. It provides a high-level framework for building scalable and efficient crawlers.
- Selenium: A popular headless browser automation library for web scraping and crawling. It supports JavaScript execution and can interact with web pages in a browser environment like human users would.
Another important aspect of web crawling you should pay attention to is best practices. You'll learn them in the next section.
Best Web Crawling Practices in Python and Considerations
The Python crawling best practices below will help you build a more robust script.
Crawling JavaScript-Rendered Web Pages in Python
Crawling pages that rely on JavaScript for rendering or data retrieval can be challenging because traditional libraries like Requests and BeautifulSoup can't execute dynamic interactions.
The easiest way to crawl dynamic websites in Python is to use headless browser automation tools like Selenium or Playwright. These libraries allow you to control web browsers programmatically and simulate user interactions like clicking, scrolling, and more. They even allow you to crawl React applications.
Parallel Scraping in Python With Concurrency
Previously, the crawler handles one page at a time. After sending an HTTP request, it remains idle, waiting for the server's response before proceeding. This approach is less efficient and increases crawling time.
To improve performance, introduce multi-threading to process multiple pages concurrently. Each thread fetches and parses a page independently, significantly speeding up the crawl. For thread safety, ensure that only one thread can access the crawled URLs through synchronization or locking.
Fortunately, Python's Queue is thread-safe, meaning multiple threads can read from and write to it without causing race conditions. A race condition occurs when threads try to modify shared data simultaneously. Since Queue handles synchronization internally, if one thread is accessing a resource, others must wait their turn, ensuring data integrity and preventing conflicts.
Add the threading module to your import and use it to create a thread lock to track and update the visited_urls
set. Then, define a worker to run multiple threads. The final crawler code looks like the following after making these changes:
# pip3 install requests beautifulsoup4 tenacity
import requests
from bs4 import BeautifulSoup
import re
import csv
from queue import Queue
from tenacity import retry, stop_after_attempt, wait_exponential
import threading
target_url = "https://www.scrapingcourse.com/ecommerce/"
# instantiate the queues
high_priority_queue = Queue()
low_priority_queue = Queue()
# add the initial URL to the queues
high_priority_queue.put(target_url)
low_priority_queue.put(target_url)
# define a thread-safe set for visited URLs
visited_urls = set()
visited_lock = threading.Lock()
# set maximum crawl limit
max_crawl = 20
# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")
# list to store scraped data
product_data = []
# activate thread lock to prevent race conditions
data_lock = threading.Lock()
# initialize session
session = requests.Session()
# implement a request retry
@retry(
stop=stop_after_attempt(4), # max retries
wait=wait_exponential(multiplier=5, min=4, max=5), # exponential backoff
)
def fetch_url(url):
response = session.get(url)
response.raise_for_status()
return response
def crawler():
# set a crawl counter to track the crawl depth
crawl_count = 0
while (
not high_priority_queue.empty() or not low_priority_queue.empty()
) and crawl_count < max_crawl:
# update the priority queue
if not high_priority_queue.empty():
current_url = high_priority_queue.get()
elif not low_priority_queue.empty():
current_url = low_priority_queue.get()
else:
break
with visited_lock:
if current_url in visited_urls:
continue
# add the current URL to the URL set
visited_urls.add(current_url)
# request the target URL
response = fetch_url(current_url)
# parse the HTML
soup = BeautifulSoup(response.content, "html.parser")
# collect all the links
for link_element in soup.find_all("a", href=True):
url = link_element["href"]
# check if the URL is absolute or relative
if not url.startswith("http"):
absolute_url = requests.compat.urljoin(target_url, url)
else:
absolute_url = url
with visited_lock:
# ensure the crawled link belongs to the target domain and hasn't been visited
if (
absolute_url.startswith(target_url)
and absolute_url not in visited_urls
):
# prioritize product pages
if url_pattern.search(absolute_url):
high_priority_queue.put(absolute_url)
else:
low_priority_queue.put(absolute_url)
# extract content only if the current URL matches the regex page pattern
if url_pattern.search(current_url):
# get the parent element
product_containers = soup.find_all("li", class_="product")
# scrape product data
for product in product_containers:
data = {
"Url": product.find("a", class_="woocommerce-LoopProduct-link")[
"href"
],
"Image": product.find("img", class_="product-image")["src"],
"Name": product.find("h2", class_="product-name").get_text(),
"Price": product.find("span", class_="price").get_text(),
}
with data_lock:
# append extracted data
product_data.append(data)
# update the crawl count
crawl_count += 1
# specify the number of threads to use
num_workers = 4
threads = []
# start worker threads
for _ in range(num_workers):
thread = threading.Thread(target=crawler, daemon=True)
threads.append(thread)
thread.start()
# wait for all threads to finish
for thread in threads:
thread.join()
# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
writer.writeheader()
writer.writerows(product_data)
Be careful when running the code, as big numbers in num_workers
would start many requests at a time, increasing the chances of getting blocked. However, a high num_workers
value often reduces execution time.
Take a look at the benchmarks below to verify the performance results:
- Sequential requests: 29,32s.
- Queue with one worker (
num_workers = 1
): 29,41s. - Queue with two workers (
num_workers = 2
): 20,05s. - Queue with five workers (
num_workers = 5
): 11,97s. - Queue with ten workers (
num_workers = 10
): 12,02s.
There's almost no difference between sequential requests and having one worker. But after adding multiple workers, that overhead pays off. Find out why in our tutorial on web scraping with concurrency in Python.
Distributed Web Scraping in Python
Distributed crawling involves spreading the crawling job across several servers. While this technique can help you achieve maximum scalability and improve performance, it's a complex process.
Fortunately, you can achieve distributed web crawling in Python using libraries like Celery or Redis Queue.
[FEATURED: https://www.zenrows.com/blog/distributed-web-crawling]
Persistency
Data persistence is helpful for scalability and maintenance. In a real-world crawling scenario, you should store:
- The discovered URLs with their timestamps.
- Status codes with timestamps to reschedule failed requests.
- The scraped content.
- The HTML documents for later processing.
You should export this information to files and/or store it in a database.
Canonicals to Avoid Duplicate URLs
The current crawling process doesn't account for canonical links, which website owners often use to indicate that multiple URLs represent the same page.
To avoid redundant crawling, you should check if a canonical URL is in the page's <link rel= "canonical">
tag. If found, treat it as the official URL and add it to the visited list instead of the original request URL. This prevents duplicate processing when the same page is accessed from different links.
Additionally, a single page can have multiple URLs due to query parameters (e.g., ?ref=123
) or hash fragments (e.g., #section1
). Without normalization, the crawler may treat these variations as separate pages and crawl them multiple times. While you've used a URL resolver in this tutorial, you can also apply the url_query_cleaner
to strip unnecessary parameters and standardize URLs before checking for duplicates.
Conclusion
In this guide, you learned the fundamentals of web crawling. You started with the basics and then moved on to more advanced topics to become a Python crawling expert!
Now you know:
- What a web crawler is.
- How to build a crawler in Python.
- What to do to make your script production-ready.
- The best Python crawling libraries.
- The Python web crawling best practices and advanced techniques.
Regardless of how robust your crawler is, anti-bot measures can detect and block it. You can eliminate any challenge with ZenRows, an all-in-one web scraping solution for crawling at scale without limitations.
Frequent Questions
How Do I Create a Web Crawler in Python?
To create a web crawler in Python, first define the initial URL and maintain a set of visited URLs. You can then use libraries such as Requests or Scrapy to send HTTP requests and retrieve HTML content. The crawler extracts relevant information from the HTML. Finally, repeat the process by following the links discovered on the pages.
Can Python Be Used for a Web Crawler?
Yes, Python is widely used for web crawling thanks to its rich ecosystem of libraries and tools. It offers libraries like Requests, Beautiful Soup, and Scrapy, simplifying data extraction at scale.
What Is a Web Crawler Used for?
You can use a web crawler to browse and collect information from sites systematically. They automate fetching web pages and following links to discover new web content. Crawlers are popular for web indexing, content aggregation, and URL discovery.
How Do You Crawl Data from a Website in Python?
To crawl data from a website in Python, send an HTTP request to the desired URL using a library like Requests. Retrieve the HTML response, analyze the page structure, apply data extraction techniques such as CSS selectors or XPath expressions to find and extract the desired elements and do the parsing with libraries such as BeautifulSoup.
What Are the Different Ways to Crawl Web Data in Python?
There are several ways to crawl web data in Python, depending on the target site's complexity and the project's size. Use libraries like Requests and Beautiful Soup for basic crawling and scraping tasks. You might prefer complete crawling frameworks like Scrapy for more advanced functionality and flexibility. Try ZenRows with a single API call to significantly reduce the overall complexity and bypass all anti-bot measures.