Python Requests and BeautifulSoup Integration

Learn how to integrate ZenRows API with Python Requests and BeautifulSoup to extract the data you want. From basic calls to advanced features such as auto-retry and concurrency. We will walk over each stage of the process, from installation to final code, explaining everything we code.

For a short version, go to the final code and copy it. It is commented with the parts that must the completed and helpful suggestions for the more challenging details.

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests beautifulsoup4

You will also need to register to get your API Key.

Using Requests to Get a Page

The first library we will see is requests, an HTTP library for Python. It exposes a get method that will call a URL and return its HTML. For the time being, we won't be utilizing any parameters; this is simply a demo to see how it works.

Careful! This script will execute without any proxy so that the server will see your actual IP. You don't need to run this snippet.

import requests

url = "" # ... your URL here
response = requests.get(url)

print(response.text)  # pages's HTML

Calling ZenRows API with Requests

Connecting requests to ZenRows API is straightforward. get's target will be the API base and then two params: apikey for authentication and url. URLs must be encoded; however, requests will handle that when using params.

With this simple update, we will manage most scraping problems, such as proxy rotation, setting correct headers, avoiding CAPTCHAs and anti-bot solutions, and many more. But there are a few issues that we will address now. Keep on reading.

import requests

url = "" # ... your URL here
apikey = "YOUR_KEY" # paste your API Key here
zenrows_api_base = "https://api.zenrows.com/v1/"

response = requests.get(zenrows_api_base, params={
	"apikey": apikey,
	"url": url,
})

print(response.text)  # pages's HTML

Extracting Basic Data with BeautifulSoup

We'll now use BeautifulSoup to parse the HTML on the page and extract some data. We will write a simple function called extract_content that returns URL, title, and h1 content. There is where you can put your custom extracting logic.

import requests
from bs4 import BeautifulSoup

url = "" # ... your URL here
apikey = "YOUR_KEY" # paste your API Key here
zenrows_api_base = "https://api.zenrows.com/v1/"

def extract_content(url, soup):
	# extracting logic goes here
	return {
		"url": url,
		"title": soup.title.string,
		"h1": soup.find("h1").text,
	}

response = requests.get(zenrows_api_base, params={
	"apikey": apikey,
	"url": url,
})
soup = BeautifulSoup(response.text, "html.parser")
content = extract_content(url, soup)

print(content)  # custom scraped content

List of URLs with Concurrency

Up until now, we were scraping a single URL. Instead, we will now introduce a list of URLs more relevant to a real-world use case. In addition, we will set up concurrency, so we don't have to wait for the sequential process to complete. It will allow the script to process multiple URLs simultaneously, always with a maximum. That number is determined by the plan you are in.

In short, multiprocessing package implements a ThreadPool that will queue and execute all our requests. And it will do so by handling the parallelism for us and the maximum number of requests going on simultaneously, but never over the limit (10 in the example). Once all the requests finish, it will group all the results in a single variable, and we will print them. In a real case, for example, store them in a database.

Note that this is not a queue; we can add no new URLs once the process initiates. If that is your use case, check out our guide on how to Scrape and Crawl from a Seed URL.

import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool

apikey = "YOUR_KEY" # paste your API Key here
zenrows_api_base = "https://api.zenrows.com/v1/"

concurrency = 10
urls = [
	# ... your URLs here
]

def extract_content(url, soup):
	# extracting logic goes here
	return {
		"url": url,
		"title": soup.title.string,
		"h1": soup.find("h1").text,
	}

def scrape_with_zenrows(url):
	response = requests.get(zenrows_api_base, params={
		"apikey": apikey,
		"url": url,
	})
	soup = BeautifulSoup(response.text, "html.parser")
	return extract_content(url, soup)

pool = ThreadPool(concurrency)
results = pool.map(scrape_with_zenrows, urls)
pool.close()
pool.join()

[print(result) for result in results]  # custom scraped content

Auto-Retry Failed Requests

The final step in creating a robust scraper is to retry on failed requests. We will be using Retry from urllib3 and HTTPAdapter from requests.

The basic idea is as follows:
  1. Using the return status code, identify the failed requests.
  2. Wait an arbitrary amount of time. In our example, it will grow exponentially between tries.
  3. Retry the request until it succeeds or reaches a maximum number of retries.

Fortunately, we can use these two libraries to implement that behavior. We must first configure Retry and then mount the HTTPAdapter for a requests session. Unlike the previous ones, we won't be calling requests.get directly but requests_session.get. Once created the session, it will use the same adapter for all subsequent calls.

For more information, visit the article on Retry Failed Requests.

import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

apikey = "YOUR_KEY" # paste your API Key here
zenrows_api_base = "https://api.zenrows.com/v1/"
urls = [
	# ... your URLs here
]
concurrency = 10  # maximum concurrent requests, depends on the plan

requests_session = requests.Session()
retries = Retry(
	total=3,  # number of retries
	backoff_factor=1,  # exponential time factor between attempts
	status_forcelist=[429, 500, 502, 503, 504]  # status codes that will retry
)

requests_session.mount("http://", HTTPAdapter(max_retries=retries))
requests_session.mount("https://", HTTPAdapter(max_retries=retries))

def extract_content(url, soup):
	# extracting logic goes here
	return {
		"url": url,
		"title": soup.title.string,
		"h1": soup.find("h1").text,
	}

def scrape_with_zenrows(url):
	try:
		response = requests_session.get(zenrows_api_base, params={
			"apikey": apikey,
			"url": url,
		})

		soup = BeautifulSoup(response.text, "html.parser")
		return extract_content(url, soup)
	except Exception as e:
		print(e)  # will print "Max retries exceeded"

pool = ThreadPool(concurrency)
results = pool.map(scrape_with_zenrows, urls)
pool.close()
pool.join()

[print(result) for result in results if result]  # custom scraped content

If you have any problem with the implementation or it does not work for your use case, contact us and we'll help you.