How to Build a Web Crawler in Python

Sergio Nonide
Sergio Nonide
Updated: March 3, 2025 · 10 min read

Building a scalable web crawler is understandably no small task. The fact is that crawling at scale presents several challenges, including traversing multiple pages and optimizing performance.

But no worries. We'll guide you through building your own efficient Python web crawler that scales effortlessly.

What Is a Web Crawler in Python?

A web crawler navigates pages, discovers links, and follows them, usually to extract specific information. You can also think of it as a search engine, which searches the web to find links.

Web crawling and scraping share some similarities, though. However, their operations differ slightly. While a web crawler discovers and indexes many web pages, a web scraper extracts and stores specific data from those pages.

If you want to create a production-ready web crawler, you'll follow a few steps that let you add new URLs to your crawl queue continuously. That said, you also want your web crawler to avoid common problems like link duplication, endless crawling, latency, IP bans, etc.

To learn more, check out our in-depth guide on Web Scraping vs. Web Crawling.

Now that you know what a web crawler is, it's time to build one.

Build Your First Python Web Crawler

We'll now crawl this E-commerce Challenge website. The site has many pages. And these include links to paginated products, shopping carts, category pages, and more. Your web crawler will follow some of these links and extract specific product details, including product names, prices, URLs, and image links.

See what the page looks like below:

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Before jumping into the tutorial, let's start with the requirements.

Prerequisites for Python Web Crawling

You'll need the following tools to get started:

You'll also need an HTTP client and an HTML parser for web crawling. The two most popular Python packages for that are:

  • Requests: A powerful HTTP client library that facilitates the execution of HTTP requests and handles their responses.
  • Beautiful Soup: A full-featured HTML and XML parser that exposes a complete API to explore the DOM, select HTML elements, and retrieve data from them.

Install them both using the following command:

Terminal
pip3 install beautifulsoup4 requests

Great! You're ready to begin.

The following sections provide a step-by-step guide to building your first web crawler in Python.

In this step, you'll design your crawling logic to discover and follow all the links on the target website. Before that, you'll make a simple request to the target website.

Create a crawler function with a request to the target page. Behind the scenes, the get method performs an HTTP GET request to the specified URL and returns its full-page HTML:

crawler.py
# pip3 install requests
import requests

# request the target URL
def crawler():
    response = requests.get("https://www.scrapingcourse.com/ecommerce/")
    response.raise_for_status()
    print(response.text)

# execute the crawler
crawler()

The above is a simple GET request to the target website that only returns its HTML as a response:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
    <ul class="products columns-4">

        <!--- ... --->

    </ul>
</body>
</html>
Crawl websites at scale without getting blocked.
ZenRows improves your data collection workflow with fast and scalable web crawlers.
Try for Free

Let's modify the current code to a full-blown web crawler.

Add Beautiful Soup to your imports and set the target URL as the initial link in the urls_to_visit list.

crawler.py
# pip3 install requests beautifulsoup4

# ...
from bs4 import BeautifulSoup

target_url = "https://www.scrapingcourse.com/ecommerce/"

# initialize the list of discovered URLs
urls_to_visit = [target_url]

Modify the crawler function to run continuously with a while loop as long as the urls_to_visit isn't empty. Get the next link from the last index of the URL list and parse the website's HTML with Beautiful Soup. Extract all href attributes from a tags and convert them to absolute URLs. Then, update the urls_to_visit list with newly discovered links.

Here's the modified code:

crawler.py
# ...

def crawler():

    while urls_to_visit:

        # get the page to visit from the list
        current_url = urls_to_visit.pop(0)

        # ...

        # parse the HTML
        soup = BeautifulSoup(response.text, "html.parser")

        # collect all the links
        link_elements = soup.select("a[href]")
        for link_element in link_elements:
            url = link_element["href"]

            # convert links to absolute URLs
            if not url.startswith("http"):
                absolute_url = requests.compat.urljoin(target_url, url)
            else:
                absolute_url = url

            # ensure the crawled link belongs to the target domain and hasn't been visited
            if (
                absolute_url.startswith(target_url)
                and absolute_url not in urls_to_visit
            ):
                urls_to_visit.append(url)

The code might run an endless crawl at this point because it doesn't specify a depth limit. Setting a depth limit ensures the crawler stops after reaching a predefined crawl depth, preventing excessive or infinite crawling.

To apply a depth limit, set the maximum crawl length to 20. Then, use a crawl counter to track the crawl depth. Update the while loop logic with these changes as shown:

crawler.py
# set a maximum crawl limit
max_crawl = 20


def crawler():
    # set a crawl counter to track the crawl depth
    crawl_count = 0

    while urls_to_visit and crawl_count < max_crawl:

        # ...

        # ... crawling logic

        # update the crawl counter
        crawl_count += 1

Finally, print the urls_to_visit list and execute the crawler function:

crwaler.py
# ...
def crawler():
    # ...

    # print the crawled URLs
    print(urls_to_visit)

# execute the crawl
crawler()

Combine the snippets at this point, and you'll get the following complete and updated code:

crawler.py
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup

target_url = "https://www.scrapingcourse.com/ecommerce/"

# initialize the list of discovered URLs
urls_to_visit = [target_url]

# set a maximum crawl limit
max_crawl = 20

def crawler():
    # set a crawl counter to track the crawl depth
    crawl_count = 0

    while urls_to_visit and crawl_count < max_crawl:

        # get the page to visit from the list
        current_url = urls_to_visit.pop()

        # request the target URL
        response = requests.get(current_url)
        response.raise_for_status()
        # parse the HTML
        soup = BeautifulSoup(response.text, "html.parser")

        # collect all the links
        link_elements = soup.select("a[href]")
        for link_element in link_elements:
            url = link_element["href"]

            # convert links to absolute URLs
            if not url.startswith("http"):
                absolute_url = requests.compat.urljoin(target_url, url)
            else:
                absolute_url = url

            # ensure the crawled link belongs to the target domain and hasn't been visited
            if (
                absolute_url.startswith(target_url)
                and absolute_url not in urls_to_visit
            ):
                urls_to_visit.append(absolute_url)

            # update the crawl count
            crawl_count += 1

    # print the crawled URLs
    print(urls_to_visit)

# execute the crawl
crawler()

The above code outputs all the URLs for the first 20 crawls:

Output
[
    'https://www.scrapingcourse.com/ecommerce/?add-to-cart=2740',
    'https://www.scrapingcourse.com/ecommerce/product/ajax-full-zip-sweatshirt/',

    // ... omitted for brevity,

    'https://www.scrapingcourse.com/ecommerce/page/1/',
    'https://www.scrapingcourse.com/ecommerce/page/5/',

    // ... omitted for brevity,
]

Great! It's time to filter the above links and extract data from specific product pages.

Step 2: Extract Data From the Crawler

Once you have the links, the next step is to extract data from selected pages (specifically product pages). To achieve this, you'll modify your web spider to scrape product data only when the currently crawled URL points to a product page.

First, you need to understand the format of the product page URL. This will help you create a solid URL regex that distinguishes product links from others.

Open the target site via a browser and navigate to the second product page (page 2). You'll see it has the following URL, with the path parameter /page/2:

Example
https://www.scrapingcourse.com/ecommerce/page/2

The page number in the path parameter changes dynamically as you navigate the paginated product pages.

Use Python's regex to match this path parameter. Then, create an empty list (product_data) to collect scraped data:

crawler.py
# ...

import re

# ...

# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")

# define a list to collect scraped data
product_data = []

# ... crawler function

Implement the scraping logic only if a URL's path matches the defined regex. Finally, append the extracted data to the product_data list and print it after executing the crawler:

crawler.py
# ...

def crawler():
    # ...

    while urls_to_visit and crawl_count < max_crawl:

        # ...

        # extract content only if the current URL matches the regex pattern
        if url_pattern.search(current_url):
            # get the parent element
            product_containers = soup.find_all("li", class_="product")

            # scrape product data
            for product in product_containers:
                data = {
                    "Url": product.find("a", class_="woocommerce-LoopProduct-link")[
                        "href"
                    ],
                    "Image": product.find("img", class_="product-image")["src"],
                    "Name": product.find("h2", class_="product-name").text,
                    "Price": product.find("span", class_="price").text,
                }

                # append extracted data
                product_data.append(data)

            # ...
# ...

# print the extracted data
print(product_data)

Here's what the full code looks like at this point:

crawler.py
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
import re
import csv

target_url = "https://www.scrapingcourse.com/ecommerce/"

# initialize the list of discovered URLs
urls_to_visit = [target_url]

# set a maximum crawl limit
max_crawl = 20

# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")

# define a list to collect scraped data
product_data = []


def crawler():
    # set a crawl counter to track the crawl depth
    crawl_count = 0

    while urls_to_visit and crawl_count < max_crawl:

        # get the page to visit from the list
        current_url = urls_to_visit.pop(0)

        # request the target URL
        response = requests.get(current_url)
        response.raise_for_status()

        # parse the HTML
        soup = BeautifulSoup(response.content, "html.parser")

        # collect all the links
        for link_element in soup.find_all("a", href=True):
            url = link_element["href"]

            # convert links to absolute URLs
            if not url.startswith("http"):
                absolute_url = requests.compat.urljoin(target_url, url)
            else:
                absolute_url = url

            # ensure the crawled link belongs to the target domain and hasn't been visited
            if (
                absolute_url.startswith(target_url)
                and absolute_url not in urls_to_visit
            ):
                urls_to_visit.append(absolute_url)

        # extract content only if the current URL matches the regex page pattern
        if url_pattern.search(current_url):
            # get the parent element
            product_containers = soup.find_all("li", class_="product")

            # scrape product data
            for product in product_containers:
                data = {
                    "Url": product.find("a", class_="woocommerce-LoopProduct-link")[
                        "href"
                    ],
                    "Image": product.find("img", class_="product-image")["src"],
                    "Name": product.find("h2", class_="product-name").get_text(),
                    "Price": product.find("span", class_="price").get_text(),
                }

                # append extracted data
                product_data.append(data)

        # update the crawl count
        crawl_count += 1


# execute the crawl
crawler()

# print the extracted data
print(product_data)

The code outputs the scraped data as shown:

Output
[
    {
        "Url": "https://www.scrapingcourse.com/ecommerce/product/atlas-fitness-tank/",
        "Image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mt11-blue_main.jpg",
        "Name": "Atlas Fitness Tank",
        "Price": "$18.00",
    },
    # ...omitted for brevity
    {
        "Url": "https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/",
        "Image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main.jpg",
        "Name": "Zoltan Gym Tee",
        "Price": "$29.00",
    },
]

Your Python web crawler now scrapes data specifically from product URLs. Let's export the extracted data to a CSV in the next section.

Step 3: Extract Data into CSV

Data storage allows you to persist and share the scraped information for further analysis, referencing, and more. You can save the data in JSON, CSV, or a local or remote database. However, for simplicity, we'll use a CSV in this article.

To store the scrape data, import Python's built-in csv package and specify the CSV file path. Then, write each entry to a new row:

crawler.py
# ...
import csv

# ...

# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
    writer.writeheader()
    writer.writerows(product_data)

Combine all the snippets from the previous sections with this one. Here's the full code:

crawler.py
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
import re
import csv

target_url = "https://www.scrapingcourse.com/ecommerce/"

# initialize the list of discovered URLs
urls_to_visit = [target_url]

# set a maximum crawl limit
max_crawl = 20

# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")

# define a list to collect scraped data
product_data = []

def crawler():
    # set a crawl counter to track the crawl depth
    crawl_count = 0
 
    while urls_to_visit and crawl_count < max_crawl:

        # get the page to visit from the list
        current_url = urls_to_visit.pop(0)

        # request the target URL
        response = requests.get(current_url)
        response.raise_for_status()

        # parse the HTML
        soup = BeautifulSoup(response.content, "html.parser")

        # collect all the links
        for link_element in soup.find_all("a", href=True):
            url = link_element["href"]

            # convert links to absolute URLs
            if not url.startswith("http"):
                absolute_url = requests.compat.urljoin(target_url, url)
            else:
                absolute_url = url

            # ensure the crawled link belongs to the target domain and hasn't been visited
            if (
                absolute_url.startswith(target_url)
                and absolute_url not in urls_to_visit
            ):
                urls_to_visit.append(absolute_url)

        # extract content only if the current URL matches the regex page pattern
        if url_pattern.search(current_url):
            # get the parent element
            product_containers = soup.find_all("li", class_="product")

            # scrape product data
            for product in product_containers:
                data = {
                    "Url": product.find("a", class_="woocommerce-LoopProduct-link")[
                        "href"
                    ],
                    "Image": product.find("img", class_="product-image")["src"],
                    "Name": product.find("h2", class_="product-name").get_text(),
                    "Price": product.find("span", class_="price").get_text(),
                }

                # append extracted data
                product_data.append(data)

        # update the crawl count
        crawl_count += 1

# execute the crawl
crawler()

# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
    writer.writeheader()
    writer.writerows(product_data)

Run the script, and the products.csv file will appear in your project's folder:

scrapingcourse ecommerce product output csv
Click to open the image in full screen

That's it! You just built your first Python web crawler.

That said, you still need to optimize your crawler to make it production-ready. Let's see that next!

Optimize Your Python Web Crawler

The current web crawling script follows links and extracts specific content. However, it needs to handle limitations like duplicate crawls, priorities and retries more efficiently. This section will explain how to optimize it.

Although the current crawler validates the presence of new links in the URL list (urls_to_visit), this method alone doesn't filter duplicates efficiently.

The best way to prevent duplicates is to track crawled links using a pre-filtered data type like a set. In addition to automatically filtering duplicates, Python's set uses a hash table, providing faster lookup with minimal delay. This results in significantly quicker duplicate detection and contributes to improved performance.

To handle duplicates with Python's Sets, update the previous crawling script to track visited URLs with the set function. Check if the crawled URL is already stored in the visited_urls set. If true, skip that URL and add the next one to the set:

crawler.py
# ...

# create a set to track visited URLs
visited_urls = set()

def crawler():
    # ...

    while urls_to_visit and crawl_count < max_crawl:

        # ...

        if current_url in visited_urls:
            continue

        # add the current URL to the URL set
        visited_urls.add(current_url)

        # collect all the links
        for link_element in soup.find_all("a", href=True):
            # ...

            # ensure the crawled link belongs to the target domain and hasn't been visited
            if url.startswith(target_url) and url not in visited_urls:
                # ...

        # ... scraping logic

# ...

These improvements help your crawler efficiently avoid duplicate visits to previously discovered links.

Add a Priority Queue

Priority queues let you prioritize specific URLs over others during the crawling process.

In this case, your spider will prioritize product page URLs. This technique can even extend the amount of data you can scrape, as your crawler now traverses product pages before other pages.

To prioritize specific pages, you can use Python's built-in Queue package from PriorityQueue

Import the Queue library. Then, create high and low-priority queues and replace the previous urls_to_visit list with these queues. If the current URL matches a product page path, append it to the high-priority queue. Otherwise, push it to the low-priority queue:

crawler.py
# ...
from queue import Queue

# ...

# instantiate the queues
high_priority_queue = Queue()
low_priority_queue = Queue()

# create priority queues
high_priority_queue.put(target_url)
low_priority_queue.put(target_url)

# ...

def crawler():
    # ...

    while (
        not high_priority_queue.empty() or not low_priority_queue.empty()
    ) and crawl_count < max_crawl:

        # update the priority queue
        if not high_priority_queue.empty():
            current_url = high_priority_queue.get()
        elif not low_priority_queue.empty():
            current_url = low_priority_queue.get()
        else:
           break

        # ...

        # collect all the links
        for link_element in soup.find_all("a", href=True):
            # ...

            # ensure the crawled link belongs to the target domain and hasn't been visited
            if absolute_url.startswith(target_url) and absolute_url not in visited_urls:
                # prioritize product pages
                if url_pattern.search(absolute_url):
                    high_priority_queue.put(absolute_url)
                else:
                    low_priority_queue.put(absolute_url)

        # ... scraping logic
# ...

Great! Since you've now prioritized product page URLs, your crawler has a higher chance of reaching more of those pages.

Apply Request Retries and Delays

Request retries allow you to check and resend failed requests. While implementing retries, adding random or exponential delays is a good practice to avoid overloading the server or risking an IP ban. An exponential delay increases the wait time exponentially for each failed request.

The easiest way to implement a retry logic with exponential backoff is to use Tenacity, a Python library for handling retries. Install it using pip:

Terminal
pip3 install tenacity

Next, import the library and add it as a retry decorator in a new fetch_url function. The retry decorator lets you specify the maximum number of retries and backoff parameters. Since the fetch_url function now handles requests, ensure you use it in place of the previous request code block:

crawler.py
# pip3 install requests beautifulsoup4 tenacity
# ...
from tenacity import retry, stop_after_attempt, wait_exponential

# ...


# implement a request retry mechanism
@retry(
    stop=stop_after_attempt(4),  # maximum number of retries
    wait=wait_exponential(multiplier=5, min=4, max=5),  # exponential backoff
)
def fetch_url(url):
    response = requests.get(url)
    response.raise_for_status()
    return response


def crawler():
    # ...
   
    while (high_priority_queue or low_priority_queue) and crawl_count < max_crawl:

        #    ...

        # request the target URL
        response = fetch_url(current_url)

        # ...

# ...

With the above modifications, your crawler will now retry failed requests automatically.

Maintain a Single Crawl Session

Sessions let you reuse connections, reducing redundant TCP handshakes and persisting headers and cookies across pages. This can reduce network overheads and improve overall performance.

You can implement a session with the Session object from the Requests library. Create a new session instance and use it to update the fetch_url function like so:

crawler.py
# ...

# initialize session
session = requests.Session()

# ...
def fetch_url(url):
    response = session.get(url)
    # ...

Splendid! Your Python web crawler now manages HTTP requests with a session.

Combine the snippets from each section to see what the final code looks like:

crawler.py
# pip3 install requests beautifulsoup4 tenacity
import requests
from bs4 import BeautifulSoup
import re
import csv
from tenacity import retry, stop_after_attempt, wait_exponential
from queue import Queue

target_url = "https://www.scrapingcourse.com/ecommerce/"

# instantiate the queues
high_priority_queue = Queue()
low_priority_queue = Queue()

# create priority queues
high_priority_queue.put(target_url)
low_priority_queue.put(target_url)

# create a set to track visited URLs
visited_urls = set()

# set a maximum crawl limit
max_crawl = 20

# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")

# define a list to collect scraped data
product_data = []

# initialize session
session = requests.Session()

# implement a request retry
@retry(
    stop=stop_after_attempt(4),  # maximum number of retries
    wait=wait_exponential(multiplier=5, min=4, max=5),  # exponential backoff
)
def fetch_url(url):
    response = session.get(url)
    response.raise_for_status()
    return response

def crawler():
    # set a crawl counter to track the crawl depth
    crawl_count = 0
 
    while (
        not high_priority_queue.empty() or not low_priority_queue.empty()
    ) and crawl_count < max_crawl:

        # update the priority queue
        if not high_priority_queue.empty():
            current_url = high_priority_queue.get()
        elif not low_priority_queue.empty():
            current_url = low_priority_queue.get()
        else:
           break

        if current_url in visited_urls:
            continue

        # add the current URL to the URL set
        visited_urls.add(current_url)

        # request the target URL
        response = fetch_url(current_url)

        # parse the HTML
        soup = BeautifulSoup(response.content, "html.parser")

        # collect all the links
        for link_element in soup.find_all("a", href=True):
            url = link_element["href"]

            # convert links to absolute URLs
            if not url.startswith("http"):
                absolute_url = requests.compat.urljoin(target_url, url)
            else:
                absolute_url = url

            # ensure the crawled link belongs to the target domain and hasn't been visited
            if absolute_url.startswith(target_url) and absolute_url not in visited_urls:
                # prioritize product pages
                if url_pattern.search(absolute_url):
                    high_priority_queue.put(absolute_url)

                else:
                    low_priority_queue.put(absolute_url)

        # extract content only if the current URL matches the regex page pattern
        if url_pattern.search(current_url):
            # get the parent element
            product_containers = soup.find_all("li", class_="product")

            # scrape product data
            for product in product_containers:
                data = {
                    "Url": product.find("a", class_="woocommerce-LoopProduct-link")[
                        "href"
                    ],
                    "Image": product.find("img", class_="product-image")["src"],
                    "Name": product.find("h2", class_="product-name").get_text(),
                    "Price": product.find("span", class_="price").get_text(),
                }

                # append extracted data
                product_data.append(data)

        # update the crawl count
        crawl_count += 1

# execute the crawl
crawler()

# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
    writer.writeheader()
    writer.writerows(product_data)

Bravo! You just built a robust web crawler in Python. It doesn't stop here. You still need to deal with edge cases like advanced anti-bot measures, parallel crawling, JavaScript rendering, and more.

We'll explore a solution that can handle all these in the next section.

Avoid Getting Blocked While Crawling With Python

The biggest challenge of web crawling in Python is getting blocked. Many sites use anti-bot measures to identify and stop automated access, preventing you from getting your desired data.

To avoid getting blocked during web crawling, you need to apply mitigation measures. Some strategies include setting up a custom User Agent, crawling during off-peak hours, or using proxies. However, these tips are generally more effective in simple scenarios and are less effective against complex anti-bot measures.

The easiest and most reliable way to bypass anti-bot measures is to use a web scraping API, such as the ZenRows' Universal Scraper API.

ZenRows provides a seamless crawling experience with a single API call. It offers premium rotating proxies to prevent IP bans and cookie support for persisting connections across multiple requests. With support for JavaScript rendering and advanced anti-bot auto-bypass, you can simulate human interactions on the go and forget about getting blocked even while crawling the most protected websites.

Let's see how ZenRows works by scraping the Anti-bot Challenge page, a heavily protected website.

Sign up on ZenRows to open the Request Builder. Paste your target URL in the link box and activate Premium Proxies and JS Rendering.

building a scraper with zenrows
Click to open the image in full screen

Then, select Python as your programming language and choose the API connection mode. Copy the generated Python code and paste it into your scraper.

The generated Python code should look like this:

crawler.py
# pip install requests
import requests

url = "https://www.scrapingcourse.com/antibot-challenge"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)

The above code bypasses the anti-bot challenge and outputs the protected site's full-page HTML:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations!!🎉 You just bypassed an anti-bot measure with the ZenRows' Universal Scraper API.

Web Crawling Tools for Python

There are several helpful web crawling tools to make discovering links and visiting pages easier. Here's a list of some of the best Python web crawling tools that can assist you:

  1. ZenRows: A comprehensive scraping and crawling solution for bypassing CAPTCHAs and other anti-bots. It features rotating proxies, geo-localization, JavaScript rendering and advanced AI anti-bot auto-bypass.
  2. Scrapy: One of the most powerful Python crawling libraries for beginners. It provides a high-level framework for building scalable and efficient crawlers.
  3. Selenium: A popular headless browser automation library for web scraping and crawling. It supports JavaScript execution and can interact with web pages in a browser environment like human users would.

Another important aspect of web crawling you should pay attention to is best practices. You'll learn them in the next section.

Best Web Crawling Practices in Python and Considerations

The Python crawling best practices below will help you build a more robust script.

Crawling JavaScript-Rendered Web Pages in Python

Crawling pages that rely on JavaScript for rendering or data retrieval can be challenging because traditional libraries like Requests and BeautifulSoup can't execute dynamic interactions.

The easiest way to crawl dynamic websites in Python is to use headless browser automation tools like Selenium or Playwright. These libraries allow you to control web browsers programmatically and simulate user interactions like clicking, scrolling, and more. They even allow you to crawl React applications.

Parallel Scraping in Python With Concurrency

Previously, the crawler handles one page at a time. After sending an HTTP request, it remains idle, waiting for the server's response before proceeding. This approach is less efficient and increases crawling time.

To improve performance, introduce multi-threading to process multiple pages concurrently. Each thread fetches and parses a page independently, significantly speeding up the crawl. For thread safety, ensure that only one thread can access the crawled URLs through synchronization or locking.

Fortunately, Python's Queue is thread-safe, meaning multiple threads can read from and write to it without causing race conditions. A race condition occurs when threads try to modify shared data simultaneously. Since Queue handles synchronization internally, if one thread is accessing a resource, others must wait their turn, ensuring data integrity and preventing conflicts.

Add the threading module to your import and use it to create a thread lock to track and update the visited_urls set. Then, define a worker to run multiple threads. The final crawler code looks like the following after making these changes:

crawler.py
# pip3 install requests beautifulsoup4 tenacity
import requests
from bs4 import BeautifulSoup
import re
import csv
from queue import Queue
from tenacity import retry, stop_after_attempt, wait_exponential
import threading

target_url = "https://www.scrapingcourse.com/ecommerce/"

# instantiate the queues
high_priority_queue = Queue()
low_priority_queue = Queue()

# add the initial URL to the queues
high_priority_queue.put(target_url)
low_priority_queue.put(target_url)

# define a thread-safe set for visited URLs
visited_urls = set()
visited_lock = threading.Lock()

# set maximum crawl limit
max_crawl = 20

# create a regex pattern for product page URLs
url_pattern = re.compile(r"/page/\d+/")

# list to store scraped data
product_data = []

# activate thread lock to prevent race conditions
data_lock = threading.Lock()

# initialize session
session = requests.Session()


# implement a request retry
@retry(
    stop=stop_after_attempt(4),  # max retries
    wait=wait_exponential(multiplier=5, min=4, max=5),  # exponential backoff
)
def fetch_url(url):
    response = session.get(url)
    response.raise_for_status()
    return response


def crawler():
    # set a crawl counter to track the crawl depth
    crawl_count = 0

    while (
        not high_priority_queue.empty() or not low_priority_queue.empty()
    ) and crawl_count < max_crawl:

        # update the priority queue
        if not high_priority_queue.empty():
            current_url = high_priority_queue.get()
        elif not low_priority_queue.empty():
            current_url = low_priority_queue.get()
        else:
            break

        with visited_lock:
            if current_url in visited_urls:
                continue

            # add the current URL to the URL set
            visited_urls.add(current_url)

        # request the target URL
        response = fetch_url(current_url)

        # parse the HTML
        soup = BeautifulSoup(response.content, "html.parser")

        # collect all the links
        for link_element in soup.find_all("a", href=True):
            url = link_element["href"]

            # check if the URL is absolute or relative
            if not url.startswith("http"):
                absolute_url = requests.compat.urljoin(target_url, url)
            else:
                absolute_url = url

            with visited_lock:
                # ensure the crawled link belongs to the target domain and hasn't been visited
                if (
                    absolute_url.startswith(target_url)
                    and absolute_url not in visited_urls
                ):
                    # prioritize product pages 
                    if url_pattern.search(absolute_url):
                        high_priority_queue.put(absolute_url)
                    else:
                        low_priority_queue.put(absolute_url)

        # extract content only if the current URL matches the regex page pattern
        if url_pattern.search(current_url):
            # get the parent element
            product_containers = soup.find_all("li", class_="product")

            # scrape product data
            for product in product_containers:

                data = {
                    "Url": product.find("a", class_="woocommerce-LoopProduct-link")[
                        "href"
                    ],
                    "Image": product.find("img", class_="product-image")["src"],
                    "Name": product.find("h2", class_="product-name").get_text(),
                    "Price": product.find("span", class_="price").get_text(),
                }
                with data_lock:
                    # append extracted data
                    product_data.append(data)

        # update the crawl count
        crawl_count += 1


# specify the number of threads to use
num_workers = 4
threads = []

# start worker threads
for _ in range(num_workers):
    thread = threading.Thread(target=crawler, daemon=True)
    threads.append(thread)
    thread.start()

# wait for all threads to finish
for thread in threads:
    thread.join()

# save data to CSV
csv_filename = "products.csv"
with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
    writer.writeheader()
    writer.writerows(product_data)

Take a look at the benchmarks below to verify the performance results:

  • Sequential requests: 29,32s.
  • Queue with one worker (num_workers = 1): 29,41s.
  • Queue with two workers (num_workers = 2): 20,05s.
  • Queue with five workers (num_workers = 5): 11,97s.
  • Queue with ten workers (num_workers = 10): 12,02s.

There's almost no difference between sequential requests and having one worker. But after adding multiple workers, that overhead pays off. Find out why in our tutorial on web scraping with concurrency in Python.

Distributed Web Scraping in Python

Distributed crawling involves spreading the crawling job across several servers. While this technique can help you achieve maximum scalability and improve performance, it's a complex process.

Fortunately, you can achieve distributed web crawling in Python using libraries like Celery or Redis Queue.

[FEATURED: https://www.zenrows.com/blog/distributed-web-crawling]

Persistency

Data persistence is helpful for scalability and maintenance. In a real-world crawling scenario, you should store:

  • The discovered URLs with their timestamps.
  • Status codes with timestamps to reschedule failed requests.
  • The scraped content.
  • The HTML documents for later processing.

You should export this information to files and/or store it in a database.

Canonicals to Avoid Duplicate URLs

The current crawling process doesn't account for canonical links, which website owners often use to indicate that multiple URLs represent the same page.

To avoid redundant crawling, you should check if a canonical URL is in the page's <link rel= "canonical"> tag. If found, treat it as the official URL and add it to the visited list instead of the original request URL. This prevents duplicate processing when the same page is accessed from different links.

Additionally, a single page can have multiple URLs due to query parameters (e.g., ?ref=123) or hash fragments (e.g., #section1). Without normalization, the crawler may treat these variations as separate pages and crawl them multiple times. While you've used a URL resolver in this tutorial, you can also apply the url_query_cleaner to strip unnecessary parameters and standardize URLs before checking for duplicates.

Conclusion

In this guide, you learned the fundamentals of web crawling. You started with the basics and then moved on to more advanced topics to become a Python crawling expert!

Now you know:

  • What a web crawler is.
  • How to build a crawler in Python.
  • What to do to make your script production-ready.
  • The best Python crawling libraries.
  • The Python web crawling best practices and advanced techniques.

Regardless of how robust your crawler is, anti-bot measures can detect and block it. You can eliminate any challenge with ZenRows, an all-in-one web scraping solution for crawling at scale without limitations.

Try ZenRows for free!

Frequent Questions

How Do I Create a Web Crawler in Python?

To create a web crawler in Python, first define the initial URL and maintain a set of visited URLs. You can then use libraries such as Requests or Scrapy to send HTTP requests and retrieve HTML content. The crawler extracts relevant information from the HTML. Finally, repeat the process by following the links discovered on the pages.

Can Python Be Used for a Web Crawler?

Yes, Python is widely used for web crawling thanks to its rich ecosystem of libraries and tools. It offers libraries like Requests, Beautiful Soup, and Scrapy, simplifying data extraction at scale.

What Is a Web Crawler Used for?

You can use a web crawler to browse and collect information from sites systematically. They automate fetching web pages and following links to discover new web content. Crawlers are popular for web indexing, content aggregation, and URL discovery.

How Do You Crawl Data from a Website in Python?

To crawl data from a website in Python, send an HTTP request to the desired URL using a library like Requests. Retrieve the HTML response, analyze the page structure, apply data extraction techniques such as CSS selectors or XPath expressions to find and extract the desired elements and do the parsing with libraries such as BeautifulSoup.

What Are the Different Ways to Crawl Web Data in Python?

There are several ways to crawl web data in Python, depending on the target site's complexity and the project's size. Use libraries like Requests and Beautiful Soup for basic crawling and scraping tasks. You might prefer complete crawling frameworks like Scrapy for more advanced functionality and flexibility. Try ZenRows with a single API call to significantly reduce the overall complexity and bypass all anti-bot measures.

Ready to get started?

Up to 1,000 URLs for free are waiting for you