Are you scraping with Python and looking for a full-featured library that gives you all you need to build a scraper in one installation? Hrequests will fit your needs.
In this tutorial, you'll learn how Hrequests works and use its basic and advanced features to extract content from static and dynamic websites.
What Is Hrequests?
Hrequests, a short form of "human requests", is a web scraping library for extracting data in Python. It combines an HTTP client, an HTML parser, and a complete headless browser interface, making it suitable for scraping static and dynamic websites.
If you're extracting data from a static website and want to avoid browser instance overhead, you can selectively apply the HTTP client and parser only. The headless browser function comes in handy when you need to automate user actions like scrolling, clicking, dragging and dropping, hovering, and more.
What's more, hrequests has a mock option to mimic human behavior during automation, boosting your scraper's anti-bot evasion capability. If you need extra scraping power, such as solving CAPTCHAs, all you need to do is add a CAPTCHA solver extension to your request with the extension option.
The library supports concurrency for scraping multiple pages simultaneously. It's also thread-safe and doesn't require packing with asyncio
for thread safety.
Now, let's build a scraper to see these features in action!
How to Build Your First Scraper With Hrequests
To explore the scraping features of Hrequests, you'll scrape product information from Scraping Course, a demo e-commerce website. You'll start by obtaining its full-page HTML before scraping specific elements and exporting the extracted data to a CSV file.
Prerequisites
Hrequests supports Python 3+. So, feel free to download and install the latest Python version. Once Python is up and running, install the Hrequests library with pip
by running the following command:
pip install -U hrequests[all]
It installs Hrequests and other supporting libraries, such as Playwright.
Hrequests uses Playright's WebDrivers under the hood. Install them from Playwright to get full-fledged browser support:
playwright install
Open a new project folder with a code editor such as VS Code. Then, create a new scraper.py
file in that folder.
You're now ready to start scraping with Hrequests in Python!
Step 1: Make Your First Request to Get HTML
The first step is to obtain the target website's HTML. Since the target website is static, you only need the Hrequests HTTP client, which has a built-in HTML parser.Â
To get the website's HTML, import Hrequests into your Python file and request the target web page. Obtain its HTML content from the response object and output it in your console:
# import the required library
import hrequests
# send a request to the target website
response = hrequests.get("https://www.scrapingcourse.com/ecommerce/")
# extract its HTML content
html_content = response.text
# output the HTML content
print(html_content)
Run the above code, and you'll get the following full-page HTML output:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
<ul class="products columns-4">
<!--- ... --->
</ul>
</body>
</html>
You can now access the website's HTML. Let's extract specific product details in the next section.
Step 2: Extract Product Data
Now, extend the previous code by scraping product names, prices, URLs, and image sources from the target website.Â
First, inspect the website's elements to expose its HTML layout. Open it with a browser like Chrome, right-click the first product, and select Inspect. Observe the elements, and you'll see that the products are inside individual list (li
) tags.
Modify the previous code to obtain all the product containers (li
). Specify an empty array to collect the extracted data. Loop through the containers to scrape the target product information.
The Hrequests parser doesn't deal with new line characters by default. Use a replace
method to remove newline characters from the extracted price data, just like in the code below.
# ...
# obtain all the product containers
products = response.html.find_all(".product")
# specify an empty data array to collect the extracted products
product_data = []
# loop through the product container array to extract specific product data
for product in products:
# collect the extracted data into a dictionary
data = {
"name": product.find(".woocommerce-loop-product__title").text,
# remove newline characters
"price": product.find(".price").text.replace("\n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
Append the scraped data to the empty list and output it:
# ...
# append the extracted data to the product_data array
product_data.append(data)
# output the extracted data
print(product_data)
Merge all the snippets. You'll get the following complete code:
# import the required library
import hrequests
# send a request to the target website
response = hrequests.get("https://www.scrapingcourse.com/ecommerce/")
# obtain all the product containers
products = response.html.find_all(".product")
# specify an empty data array to collect the extracted products
product_data = []
# loop through the product container array to extract specific product data
for product in products:
# collect the extracted data into a dictionary
data = {
"name": product.find(".woocommerce-loop-product__title").text,
# remove newline characters
"price": product.find(".price").text.replace("\n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
# append the extracted data to the product_data array
product_data.append(data)
# output the extracted data
print(product_data)
And here's the output:
[
{
'name': 'Abominable Hoodie',
'price': '$69.00',
'url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/',
'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
},
# ... other products omitted for brevity
{
'name': 'Artemis Running Short',
'price': '$45.00',
'url': 'https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/',
'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg'
}
]
Your scraper works! Let's export the data to a CSV file.
Step 3: Export as a CSV File
The next step is to export the scraped data to a CSV file to complete your scraping task.Â
Add Python's built-in CSV package to your imported libraries. Then, replace the previous print
function with the following code that writes the data to a product_data.csv
file:
# import the required libraries
# ...
import csv
# ...
# save the data to a CSV file
keys = product_data[0].keys()
with open("product_data.csv", "w", newline="", encoding="utf-8") as output_file:
dict_writer = csv.DictWriter(output_file, fieldnames=keys)
dict_writer.writeheader()
dict_writer.writerows(product_data)
print("CSV created successfully")
Update the previous scraper with the above snippet, and you'll get this final code:
# import the required library
import hrequests
import csv
# send a request to the target website
response = hrequests.get("https://www.scrapingcourse.com/ecommerce/")
# obtain all the product containers
products = response.html.find_all(".product")
# specify an empty data array to collect the extracted products
product_data = []
# loop through the product container array to extract specific product data
for product in products:
# collect the extracted data into a dictionary
data = {
"name": product.find(".woocommerce-loop-product__title").text,
# remove newline characters
"price": product.find(".price").text.replace("\n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
# append the extracted data to the product_data array
product_data.append(data)
# save the data to a CSV file
keys = product_data[0].keys()
with open("product_data.csv", "w", newline="", encoding="utf-8") as output_file:
dict_writer = csv.DictWriter(output_file, fieldnames=keys)
dict_writer.writeheader()
dict_writer.writerows(product_data)
print("CSV created successfully")
The above code writes the extracted product information to a CSV file. See the result below:
You now know how to scrape data with Hrequests. Good job!
However, the current scraper is basic and won't work for websites with dynamic content. Let's learn Hrequests' advanced concepts, which are essential for scaling your project.
Advanced Web Scraping With Hrequests
In this section, you'll learn to use Hrequests' advanced features, including scraping multiple pages, handling dynamic content, and scraping concurrently. Let's start with multiple-page scraping.
Scrape Multiple Pages
Your previous web scraper only extracts data from the first product page. However, the target website (Scrapping Course) uses pagination to distribute products across multiple pages. You'll need to scale your scraper to get all the content on that website.
Inspecting the website will reveal it only has 12 pages. Clicking the "Next" button takes you to the next page. The button disappears when you get to the last page.
To scrape all 12 pages, your scraper will follow each page until the next page button element becomes invisible in the DOM.
First, inspect the next page element. Right-click that next page button and select Inspect:
Define your scraping logic inside a scraper function. This function accepts a response option and loops through the product containers to extract target product information.
# define a scraper function
def scraper(response):
# obtain all the product containers
products = response.html.find_all(".product")
# specify an empty data array to collect the extracted products
product_data = []
# loop through the product container array to extract specific product data
for product in products:
data = {
"name": product.find(".woocommerce-loop-product__title").text,
# remove newline characters
"price": product.find(".price").text.replace("\n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
# append the extracted data to the product_data array
product_data.append(data)
# return the product data
return product_data
The next step is to navigate each page and scrape its data. Import Hrequests into your scraper and request the target website to create a response object. Specify an empty data array to collect all the scraped data after navigation.
import hrequests
# ... function for scraping logic
# request the target website
response = hrequests.get("https://www.scrapingcourse.com/ecommerce/")
# an array to collect all the products after navigation
all_products = []
Start a while
loop that tracks the next page element. Extend the empty array with new data by executing the scraper function. Follow the next page link iteratively if it still exists in the DOM. Otherwise, break the loop and print the output:
# ...
while True:
# find the next page link element
next_page = response.html.find(".next")
# extend the all_products array with new data
all_products.extend(scraper(response))
# check if the next page still exists
if next_page:
# keep scraping if it exists
response = hrequests.get(next_page.href)
else:
break
# output the scraped products after navigating all 12 pages
print(all_products)
Here's the full code after combining all the snippets:
# import the required library
import hrequests
# define a scraper function
def scraper(response):
# obtain all the product containers
products = response.html.find_all(".product")
# specify an empty data array to collect the extracted products
product_data = []
# loop through the product container array to extract specific product data
for product in products:
data = {
"name": product.find(".woocommerce-loop-product__title").text,
# remove newline characters
"price": product.find(".price").text.replace("\n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
# append the extracted data to the product_data array
product_data.append(data)
# return the product data
return product_data
# request the target website
response = hrequests.get("https://www.scrapingcourse.com/ecommerce/")
# an array to collect all the products after navigation
all_products = []
while True:
# find the next page link element
next_page = response.html.find(".next")
# extend the all_products array with new data
all_products.extend(scraper(response))
# check if the next page still exists
if next_page:
# keep scraping if it exists
response = hrequests.get(next_page.href)
else:
break
# output the scraped products after navigating all 12 pages
print(all_products)
That code extracts the specified data from all the pages on the target website:
[
{
'name': 'Abominable Hoodie',
'price': '$69.00',
'url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/',
'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
},
# ... other products omitted for brevity
{
'name': 'Zoltan Gym Tee',
'price': '$29.00',
'url': 'https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/',
'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main-324x324.jpg'
}
]
Congratulations, you've just scraped data from multiple pages with Hrequests! However, you can achieve this task faster with concurrency.
Concurrency and Parallel Requests
Concurrency lets you scrape multiple pages faster. The Hrequests library supports concurrency out of the box. All it takes is to list each page's URL in the Hrequests get
method.
The demo website formats its page URL like so:
https://www.scrapingcourse.com/ecommerce/page/<page-number>/
For example, the third page has the following URL:
https://www.scrapingcourse.com/ecommerce/page/3/
To verify, open the website via a browser like Chrome and navigate to the third page. Then, take a look at the URL in the link box:
Let's modify the previous code to scrape all the product pages concurrently.Â
Maintain the previous scraper function and extend the code with a list containing all the page URLs:
# ...
# list all the page URLs
urls = [
"https://www.scrapingcourse.com/ecommerce/page/1/",
"https://www.scrapingcourse.com/ecommerce/page/2/",
"https://www.scrapingcourse.com/ecommerce/page/3/",
"https://www.scrapingcourse.com/ecommerce/page/4/",
"https://www.scrapingcourse.com/ecommerce/page/5/",
"https://www.scrapingcourse.com/ecommerce/page/6/",
"https://www.scrapingcourse.com/ecommerce/page/7/",
"https://www.scrapingcourse.com/ecommerce/page/8/",
"https://www.scrapingcourse.com/ecommerce/page/9/",
"https://www.scrapingcourse.com/ecommerce/page/10/",
"https://www.scrapingcourse.com/ecommerce/page/11/",
"https://www.scrapingcourse.com/ecommerce/page/12/"
]
Request each URL concurrently with the Hrequests HTTP client to get their response objects. Then, loop through each response object and execute the scraper function within the extend method to write the data to the product array:
# ...
# request each URL concurrently
responses = hrequests.get(urls)
# an array to collect all the products after navigation
all_products = []
# scrape each response object iteratively
for response in responses:
# extend the all_products array with new data
all_products.extend(scraper(response))
# output the scraped products after navigating all 12 pages
print(all_products)
Update the snippets. Here's the full code:
# import the required libraries
import hrequests
# define a scraper function
def scraper(response):
# obtain all the product containers
products = response.html.find_all(".product")
# specify an empty data array to collect the extracted products
product_data = []
# loop through the product container array to extract specific product data
for product in products:
data = {
"name": product.find(".woocommerce-loop-product__title").text,
# remove newline characters
"price": product.find(".price").text.replace("\n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
# append the extracted data to the product_data array
product_data.append(data)
# return the product data
return product_data
# list all the page URLs
urls = [
"https://www.scrapingcourse.com/ecommerce/page/1/",
"https://www.scrapingcourse.com/ecommerce/page/2/",
"https://www.scrapingcourse.com/ecommerce/page/3/",
"https://www.scrapingcourse.com/ecommerce/page/4/",
"https://www.scrapingcourse.com/ecommerce/page/5/",
"https://www.scrapingcourse.com/ecommerce/page/6/",
"https://www.scrapingcourse.com/ecommerce/page/7/",
"https://www.scrapingcourse.com/ecommerce/page/8/",
"https://www.scrapingcourse.com/ecommerce/page/9/",
"https://www.scrapingcourse.com/ecommerce/page/10/",
"https://www.scrapingcourse.com/ecommerce/page/11/",
"https://www.scrapingcourse.com/ecommerce/page/12/"
]
# request each url concurrently
responses = hrequests.get(urls)
# an array to collect all the products after navigation
all_products = []
# scrape each response object iteratively
for response in responses:
# extend the all_products array with new data
all_products.extend(scraper(response))
# output the scraped products after navigating all 12 pages
print(all_products)
The code scrapes all the listed product pages as shown:
[
{
'name': 'Abominable Hoodie',
'price': '$69.00',
'url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/',
'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
},
# ... other products omitted for brevity
{
'name': 'Zoltan Gym Tee',
'price': '$29.00',
'url': 'https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/',
'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main-324x324.jpg'
}
]
That's it! Your Hrequests scraper now extracts content from multiple pages using concurrency.
However, what if you're dealing with a dynamic website?
Scrape JavaScript-Rendered Pages
As mentioned, Hrequests has a headless browser feature that lets you interact with web pages and extract JavaScript-rendered content. This feature is essential for executing user actions like scrolling, clicking, and more.
Let's see how it works by extracting product names and prices from the ScrapingCourse infinite scrolling challenge page. This website uses JavaScript to load content as you scroll down.
The logic is to scroll through the page and extract content continuously until there is no more content.
Before you begin, inspect the target website. Open the website via a browser, right-click the first element, and select Inspect.
You'll see that the products are inside individual div
tags.
Define your scraping logic inside a scraper function. This function accepts a page argument. It then extracts all the product containers with CSS selectors and loops through them to extract their names and prices:
# define a scraper function
def scraper(page):
# extract all the product containers
products = page.html.find_all(".product-item")
# specify an empty data array to collect the extracted products
product_data = []
# loop through each container to extract names and prices
for product in products:
# get all product texts into a dictionary
data = {
"name": product.find(".product-name").text,
"price": product.find(".product-price").text
}
# append each product data to the product data array
product_data.append(data)
# print the output data
print(product_data)
Spin a browser instance and visit the target website.
# ... scraping logic function
# spin a browser instance
session = Session(browser="chrome")
# create a page session and visit the target website
page = session.render("https://www.scrapingcourse.com/infinite-scrolling")
Let's start scrolling. Import the Hrequests Session class and the Python built-in time module into your scraper file. You'll use the time module to pause for more content to load before scrolling further.
Get the current page height. Then, start a while
loop that scrolls the page continuously and updates the previous height to the new height. Break the loop once there are no more heights to scroll. Finally, execute the scraper function and close the browser:
# import the required libraries
from hrequests import Session
import time
# ... scraping logic function
# get the previous height value
last_height = page.evaluate("document.body.scrollHeight")
while True:
# scroll down the page
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# pause for more elements to load
time.sleep(10)
# get the new height value
new_height = page.evaluate("document.body.scrollHeight")
# check if there are more heights to scroll
if new_height==last_height:
break
# update the previous height to the new one
last_height=new_height
# execute the scraper function
scraper(page)
# close the browser instance
page.close()
Combine the snippets, and you'll get the following complete code:
# import the required libraries
from hrequests import Session
import time
# define a scraper function
def scraper(page):
# extract all the product containers
products = page.html.find_all(".product-item")
# specify an empty data array to collect the extracted products
product_data = []
# loop through each container to extract names and prices
for product in products:
# get all product texts into a dictionary
data = {
"name": product.find(".product-name").text,
"price": product.find(".product-price").text
}
# append each product data to the product data array
product_data.append(data)
# print the output data
print(product_data)
# spin a browser instance
session = Session(browser="chrome")
# create a page session and visit the target website
page = session.render("https://www.scrapingcourse.com/infinite-scrolling")
# get the previous height value
last_height = page.evaluate("document.body.scrollHeight")
while True:
# scroll down the page
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# pause for more elements to load
time.sleep(10)
# get the new height value
new_height = page.evaluate("document.body.scrollHeight")
# check if there are more heights to scroll
if new_height==last_height:
break
# update the previous height to the new one
last_height=new_height
# execute the scraper function
scraper(page)
# close the browser instance
page.close()
That code scrolls the target website's full height and scrapes all its data. See the result below:
[
{'name': 'Chaz Kangeroo Hoodie', 'price': '$52'},
{'name': 'Teton Pullover Hoodie', 'price': '$70'},
# ... other products omitted for brevity
{'name': 'Antonia Racer Tank', 'price': '$34'},
{'name': 'Breathe-Easy Tank', 'price': '$34'}
]
Bravo! You've just used Hrequests to extract data from a web page that loads content with infinite scrolling.
You've explored some of Hrequests' most valuable features. However, despite the library's potential for building a good web scraper, it still has significant limitations.
Limitations of Hrequests for Web Scraping
Despite packing many valuable features for web scraping and crawling, Hrequests has a few shortcomings that can hinder your scraper, especially for large-scale tasks.
The first limitation is its low user base and small community, which makes solving related problems quite challenging for beginners. Another drawback is that it doesn't support proxy authentication, a requirement for implementing most premium proxy services.
Although the Hrequests documentation mentions that its Firefox WebDriver can bypass Cloudflare, our anti-bot test shows that it fails to bypass Cloudflare, Akamai, and DataDome.
For example, the current Hrequests scraper can't access a Cloudflare-protected website like the G2 Reviews. Try it out with the following code:
# import the required library
from hrequests import Session
# create a Firefox browser instance
session = Session(browser="firefox")
# open the target website
page = session.render("https://www.g2.com/products/asana/reviews")
print(page.content)
The above Hrequests scraper gets blocked by Cloudflare, as shown:
<!DOCTYPE html>
<html class="lang-en-us" lang="en-US">
<head>
<title>Just a moment...</title>
</head>
<!-- ... -->
<body>
<!-- ... -->
<div class="text-center" id="footer-text">
Performance & security by Cloudflare
</div>
</body>
</html>
Fortunately, we have the solution to all these limitations. You'll see in the next section.
Avoid Getting Blocked while Scraping With Hrequests
You can avoid all the limitations of Hrequests and scrape any protected website with a web scraping API like ZenRows. It provides a simple API that auto-rotates premium proxies, helping you to overcome Hrequests' proxy challenges.Â
ZenRows also bypasses all CAPTCHAs and any other anti-bot system, regardless of the complexity. It also acts as a headless browser to automate user actions and easily scrape dynamic websites.
Let's use ZenRows to access the G2 Reviews page that blocked you previously to see how it works.
Sign up to open the ZenRows Request builder. Paste the target URL in the link box, activate Premium Proxies, and select JS Rendering. Choose Python as your preferred language and select the API connection mode. Copy and paste the generated code into your Python script.
The generated code should look like this:
# pip install requests
import requests
url = "https://www.g2.com/products/asana/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
The code extracts the protected website's full-page HTML. See the output below:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You've just bypassed an advanced anti-bot protection using ZenRows.
Conclusion
You've seen how Hrequests works and how to use its basic and advanced features to scrape websites in Python. You've learned how to:
- Make a simple request with Hrequests to obtain full-page HTML.
- Scrape specific elements with Hrequests.
- Export scraped data to a CSV file.
- Scrape data from multiple pages of a paginated website using the next page link.
- Apply Hrequests' concurrency to extract content simultaneously from many pages.
- Use Hrequests to scrape data from a dynamic website that uses infinite scrolling.
Still, remember that the Hrequests library has limitations that can get you blocked. We recommend using ZenRows, an all-in-one web scraping solution, to bypass all anti-bot mechanisms and scrape any website without getting blocked.
Start your ZenRows free trial now without a credit card!