Scrape from a List of URLs

To do some serious scraping, we need to extract the data but also to have URLs. In this example, we assume that you already have a list of URLs ready to be scraped. We have another entry if it's not your case and you only have a seed.

There are two main ways of facing the problem: to process the URLs sequentially or in parallel. We'll start with the sequential version.

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests beautifulsoup4

Sequential

Almost any iteration will result in sequential calls to the API unless you use async code (like Javascript). Using Python, for example, a simple for would be enough in a first approach.

import requests 
 
zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_KEY" 
urls = [ 
	# ... your URLs here 
] 
 
for url in urls: 
	response = requests.get(zenrows_api_base, params={"url": url}) 
	print(response.text)

Printing the result is not particularly helpful, but you get the point. Now we will start processing some content. For that, we will use BeautifulSoup to extract the page's title. Take it as a placeholder for the extraction logic you need.

from bs4 import BeautifulSoup 
# ... 
def extract_content(soup): 
	print(soup.title.string) 
 
for url in urls: 
	response = requests.get(zenrows_api_base, params={"url": url}) 
	soup = BeautifulSoup(response.text, "html.parser") 
	extract_content(soup)

In the extract_content you can fill whatever you need from the page's content. You could create objects with the extracted data and accumulate them for later processing. That has the problem that data might be lost if there is a problem with the script and it crashes the process. There is no error control in the snippets for simplicity but take into account it for writing production code. Here's an example of a simple extractor.

def extract_content(url, soup): 
	return { 
		"url": url, 
		"title": soup.title.string, 
		"h1": soup.find("h1").text, 
	}

Everything put together:

import requests 
from bs4 import BeautifulSoup 
 
zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_KEY" 
urls = [ 
	# ... your URLs here 
] 
 
def extract_content(url, soup): 
	return { 
		"url": url, 
		"title": soup.title.string, 
		"h1": soup.find("h1").text, 
	} 
 
results = [] 
for url in urls: 
	response = requests.get(zenrows_api_base, params={"url": url}) 
	soup = BeautifulSoup(response.text, "html.parser") 
	results.append(extract_content(url, soup)) 
 
print(results)

Parallel

The code above works fine for a small set of URLs. But as we grow, we want to parallelize as much as possible. That means that sending two requests simultaneously will (more or less) divide the total time into two.

So why not send all requests at the same time? Systems usually have a maximum number of concurrent requests. ZenRows has a maximum depending on the plan you are in. Let's take ten concurrent requests for the test.

The code for parallel requests will be directly related to the language in use as opposed to the sequential one, which is more general. We will use Python since it is common, and ZenRows Python SDK, which comes with concurrency out-of-the-box. Feel free to ask us about different versions if interested.

pip install zenrows

The first thing we need to introduce is the asyncio library. Its asyncio.gather will wait for all the calls to finish. Then, we can request URLs with client.get_async.

Internally, the SKD creates a pool with a maximum number of processes. It will handle the availability of each worker and ensure that tasks get processed as soon as possible. And, at the same time, never exceeding the maximum set previously.

from zenrows import ZenRowsClient 
import asyncio 
from bs4 import BeautifulSoup 
 
client = ZenRowsClient("YOUR_KEY", concurrency=5, retries=1) 
 
urls = [ 
	# ... 
] 
 
async def call_url(url): 
	try: 
		response = await client.get_async(url) 
		if (response.ok): 
			soup = BeautifulSoup(response.text, "html.parser") 
			return { 
				"url": url, 
				"title": soup.title.string, 
				"h1": soup.find("h1").text, 
			} 
	except Exception as e: 
		pass 
 
async def main(): 
	results = await asyncio.gather(*[call_url(url) for url in urls]) 
	print(results) 
 
asyncio.run(main())

We didn't implement proper error handling for clarity's sake. If there is any error or exception, None will be the result.

We have another article on how to handle concurrency with examples in Javascript.