Scrape and Crawl from a Seed URL

To do scraping at scale, we need to extract the data but also to have a continuous flow of new URLs. You can get those from the same scraping or a previous process.

Let’s start from a case when we have only one URL. The first step would be to call the API with the URL and extract the links. To simplify and avoid problems, we’ll get only links that start with a slash (internal links).

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests beautifulsoup4

We will use BeautifulSoup for extracting those links, but it is not required for the API to work. We’ll also create functions to separate concerns.

import requests
from bs4 import BeautifulSoup

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
seed_url = "" # ... your seed URL here

def extract_links(soup):
	return [a.get("href")
			for a in soup.find_all("a")
			if a.get("href") and a.get("href").startswith("/")]

def call_url(url):
	response = requests.get(zenrows_api_base, params={"url": url})
	soup = BeautifulSoup(response.text, "html.parser")
	links = extract_links(url, soup)
	print(links)

call_url(seed_url)

Before we continue calling the API with new elements, there are a couple of things we need for security: a maximum number of requests and a list of visited URLs. Those will avoid the scripts going into infinite loops with thousands of calls without generating any real value.

max_visits = 10
visited = set()

The next critical piece we need is a queue. Once we process the seed URL, we’ll have a list of links. But thanks to Python’s queue and threading, we can forget about handling the internals and just setting a maximum number of workers and adding URLs to the queue.

The code for a queue to work is not long.

import queue
from threading import Thread

num_workers = 5

def queue_worker(i, q):
	while True:
		url = q.get() # Get an item from the queue, blocks until one is available
		# Some processing, to be defined
		q.task_done() # Notifies the queue that the item has been processed

q = queue.Queue()
for i in range(num_workers):
	Thread(target=queue_worker, args=(i, q), daemon=True).start()

q.put(seed_url)
q.join() # Blocks until all items in the queue are processed and marked as done

Now, the final step is to make those two parts work together: crawling the target site and the queue. That is a longer and more complicated snippet. Try to understand it before going to production and scraping at scale. Follow the comments for the parts you need to fill in with custom logic.

For simplicity, there is no error control, nor is the data stored anywhere. But the concerns are separated, so adding them should not be complicated. Careful when trying to crawl huge sites like Amazon without a proper maximum. It can grow to thousands of pages in a minute.

import requests
from bs4 import BeautifulSoup
import queue
from threading import Thread

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
seed_url = ""  # ... your seed URL here
max_visits = 10
num_workers = 5

visited = set()
data = []

def extract_links(soup):
	# get the links you want to follow here
	return [a.get("href")
			for a in soup.find_all("a")
			if a.get("href") and a.get("href").startswith("/")]

def extract_content(url, soup):
	# extract the content you want here
	data.append({
		"url": url,
		"title": soup.title.string,
		"h1": soup.find("h1").text,
	})

def crawl(url):
	visited.add(url)
	print("Crawl: ", url)
	response = requests.get(zenrows_api_base, params={"url": url})
	soup = BeautifulSoup(response.text, "html.parser")
	extract_content(url, soup)
	links = extract_links(soup)
	for link in links:
		if link not in visited:
			q.put(link)

def queue_worker(i, q):
	while True:
		url = q.get()
		if (len(visited) < max_visits and url not in visited):
			crawl(url)
		q.task_done()

q = queue.Queue()
for i in range(num_workers):
	Thread(target=queue_worker, args=(i, q), daemon=True).start()

q.put(seed_url)
q.join()

print("Visited:", visited)
print("Data:", data)

We have a complete series for scraping with python, take a look if you are interested. Contact us for any doubt you might still have.

Quickstart

Features

Integrations

Avoid getting blocked

Knowledge Base

Knowledge Pills

Scrape and Crawl from a Seed URL