5 Best Python Web Scraping Libraries in 2023

January 2, 2023 ยท 8 min read

Struggling to find the best Python web scraping library to use? You aren't alone. It can get pretty troublesome when you settle for a scraping library and it fails, probably because it's slow or keeps on getting detected by antibots.

A good Python library for web scraping should be fast, scalable and capable of crawling any type of web page. In this article, we'll be discussing the 5 best libraries for crawling in Python, their pros and cons, as well as providing a quick example to help you understand how they work.

What are the best Python web scraping libraries?

We did some background tests to check and verify which Python web scraping library is capable of scraping a web page without problems. The best ones are ZenRows, Selenium, Requests, Scrapy and urllib3. Here are some features worth mentioning:

Library Ease of use Performance Dynamic Data
ZenRows Easy to use It's fast for static content and moderate for dynamic content. Consumes fewer resources compared to other libraries.
Selenium Quite difficult to use compared to libraries like Requests and urllib3. Slow and it consumes high resources.
Requests One of the easiest web scraping libraries to use but have fewer capabilities Fast and low resource consumption. -
Scrapy Difficult to learn compared to the other Python web scraping libraries. Fast and medium resource consumption. -
urllib3 Similar to requests but with a lower-level API. It's fast and consumes low resources. -

Let's go into detail and discuss these libraries with some Python web scraping examples. We'll be extracting the product details on the Vue Storefront with each of them.

vuestorefront
Click to open the image in fullscreen

1. ZenRows

ZenRows API is a Python web scraping library capable of bypassing some of the biggest scraping problems, like anti-bots and CAPTCHAs. Some of its features include rotating & premium proxies, headless browser, geo-targeting, antibot and so on.

๐Ÿ‘ Pros:
  • ZenRows is easy to use.
  • It can easily bypass CAPTCHAs and antibots.
  • Smart rotational proxies.
  • It can scrape JavaScript-rendered pages.
  • It also works with other libraries.
๐Ÿ‘Ž Cons:
  • It's a paid service but it comes with a free trial.

How to scrape a web page with ZenRows

Step 1: Generate the Python code

To get started, create a free ZenRows account and navigate to the dashboard. From the dashboard, select Python and enter the target website's URL.

ZenRows Dashboard
Click to open the image in fullscreen

Since our target web page is dynamically generated, activate the JavaScript rendering option by selecting it and, from the options shown, select JavaScript instructions. For this example, you need to include the "fill" key, this is a list with the ID of the search box ("#search") and the word "laundry".

ZenRows Javascript Instructions
Click to open the image in fullscreen

The "wait_for" key makes the script wait for a specific item to appear, in this case the items with a class of "sf-product-card__title". The "wait" parameter is optional, indicating how many milliseconds to wait before retrieving the information.

Step 2: Parse the response

Since ZenRows has limited support for parsing the HTML generated, we'll be using BeautifulSoup since it has different methods, like find and find_all that can help get elements with specific IDs or classes from the HTML tree.

Go ahead and import the library, then create a new BeautifulSoup object by passing the extracted data from the URL. Then assign a second parameter, which is the parser, and it can be 'html.parser', 'xml' or 'lxml'. Make a new file called zenrowsTest.py and paste the code:

from zenrows import ZenRowsClient 
from bs4 import BeautifulSoup 
import json 
 
client = ZenRowsClient("YOUR_API_KEY") 
url = "https://demo.vuestorefront.io/" 
 
js_instructions = [ 
	{"wait":500}, 
	{"fill":["#search","laundry"]}, 
	{"wait_for":".sf-product-card__title"} 
] 
 
params = { 
	"js_render":"true", 
	"js_instructions": json.dumps(js_instructions), 
} 
 
response = client.get(url, params=params) 
soup = BeautifulSoup(response.text, "html.parser") 
 
for item in soup.find_all("span", {"class": "sf-product-card__title"}): 
	print(item.text)

Congratulations! You have successfully scraped a web page using ZenRows. Here's what the output looks like:

[Sample] Canvas Laundry Cart 
[Sample] Laundry Detergent

2. Selenium

Selenium is a Python scraping library widely used and is capable of scraping dynamic web content. With this library, you can simulate dynamic actions performed on a website, like clicking a button, filling forms and more.

๐Ÿ‘ Pros:
  • It can scrape dynamic web pages.
๐Ÿ‘Ž Cons:
  • Selenium can be slow.
  • It can't get status codes.
Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to scrape a web page with Selenium

Step 1: Find the input tag

To scrape a web page using Selenium, you can make use of a WebDriver and then locate the input tag element (the search box) by using the find_element method. After finding the correct input element, you have to write the desired query and hit Enter.

Step 2: Retrieve the span tags

Once you've found the elements, you can find the span tags of the returned items. Since the server can take too long to return the results, you can make use of WebDriverWait to wait for the server to show the results.

Once the items are available, get them by giving their class name as a parameter to the find_elements method. Here's everything we've just mentioned:

from selenium import webdriver 
from selenium.webdriver.common.by import By 
 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 
 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.support.ui import WebDriverWait 
 
url = "https://demo.vuestorefront.io/" 
 
with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install())) as driver: 
	driver.get(url) 
	input = driver.find_element(By.CSS_SELECTOR, "input[type='search']") 
	input.send_keys("laundry" + Keys.ENTER) 
 
	el = WebDriverWait(driver, timeout=3).until( 
		lambda d: d.find_element(By.CLASS_NAME, "sf-product-card__title")) 
 
	items = driver.find_elements(By.CLASS_NAME, "sf-product-card__title") 
	for item in items: 
		print(item.text)

After running the code, you should see the names of the two items printed on the console:

[Sample] Canvas Laundry Cart 
[Sample] Laundry Detergent

And there you have it!

3. Requests

Requests is a user-friendly web scraping library in Python built on top of urllib3. It can directly get a URL without a PoolManager instance and, once you make a GET request, you can access the contents of the web page by using the content property on the response object.

๐Ÿ‘ Pros:
  • It doesn't require PoolManager.
  • It's fast.
๐Ÿ‘Ž Cons:
  • It can't scrape interactive or dynamic sites with JavaScript.

How to scrape a web page using Requests

Let's work with a Vue Storefront page with a list of kitchen products. There are five items listed on the website and each one of them has a title on a span tag with a class of sf-product-card__title.

vuestorefront Kitchen category
Click to open the image in fullscreen

Step 1: Get the main contents with the GET method

Use this code:

import requests 
r = requests.get('https://demo.vuestorefront.io/c/kitchen')

The GET method returns a response object, from which you can obtain the status code with the status_code property (in this case, it returns code 200) and the HTML data with the content property. The response object is saved in the variable "r".

Step 2: Extract the specific information with BeautifulSoup

Extract the span tags with the class of sf-product-card__title by using the find_all method on the BeautifulSoup object:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(r.content, 'html.parser') 
 
for item in soup.find_all('span', {'class': 'sf-product-card__title'}): 
	print(item.text)

This will return a list of all the span tags with class found on the document and, using a simple for loop, you can print the desired information on screen. Let's make a new file called requestsTest.py and write the following code:

import requests 
from bs4 import BeautifulSoup 
 
r = requests.get('https://demo.vuestorefront.io/c/kitchen') 
soup = BeautifulSoup(r.content, 'html.parser') 
 
for item in soup.find_all('span', {'class': 'sf-product-card__title'}): 
	print(item.text)

Congratulations! You made it, you've successfully used the Request Python library for web scraping. Your output should look like this:

[Sample] Tiered Wire Basket 
[Sample] Oak Cheese Grater 
[Sample] 1 L Le Parfait Jar 
[Sample] Chemex Coffeemaker 3 Cup 
[Sample] Able Brewing System

4. Scrapy

Scrapy is a high-level framework that can be used to scrape data from highly complex websites. With Scrapy, it's possible to bypass CAPTCHAs using predefined functions or external libraries. You can write a simple Scrapy crawler to scrape data from a website by using an object definition by means of a Python class, but it's not very user-friendly compared to other Python scraping libraries.

Although the learning curve for this library is steep, you can do a lot with it and it's highly efficient in performing crawling tasks.

๐Ÿ‘ Pros:
  • General framework for scraping purposes.
  • It doesn't require BeautifulSoup.
๐Ÿ‘Ž Cons:
  • Steep learning curve.
  • Scrapy can't scrape dynamic web pages.

How to scrape a web page using Scrapy

Step 1: Create a Spider class

Make a new class named kitchenSpider and give it the parameter scrapy.Spider. Inside the class, define the name as mySpider, and start_urls as a list of the URLs to scrape.

import scrapy 
 
class kitchenSpider(scrapy.Spider): 
	name='mySpider' 
	start_urls = ['https://demo.vuestorefront.io/c/kitchen',]

Step 2: Define the parse method

The parse method takes a response parameter and you can retrieve each item with the CSS method on the response object. The CSS method can take the name of the item class as its parameter:

response.css('.sf-product-card__title')

To retrieve all the items with that class, make a for loop and print the contents with the xpath method:

for item in response.css('.sf-product-card__title'): 
	print(item.xpath('string(.)').get())

Make a new file called scrapyTest.py using the code below:

import scrapy 
 
class kitchenSpider(scrapy.Spider): 
	name='mySpider' 
	start_urls = ['https://demo.vuestorefront.io/c/kitchen',] 
 
	def parse(self, response): 
		for item in response.css('.sf-product-card__title'): 
			print(item.xpath('string(.)').get())

Run the spider by executing the following script in the terminal and you should see the list of items printed on screen:

scrapy runspider scrapyTest.py 
 
[Sample] Tiered Wire Basket 
[Sample] Oak Cheese Grater 
[Sample] 1 L Le Parfait Jar 
[Sample] Chemex Coffeemaker 3 Cup 
[Sample] Able Brewing System

That's it!

5. urllib3

urllib3 is a library that depends on other Python web scraping libraries. It works with a PoolManager instance (class), a response object that manages connection pooling and thread safety.

๐Ÿ‘ Pros:
  • Handles concurrency with PoolManager.
๐Ÿ‘Ž Cons:
  • Complicated syntax compared to other libraries like Requests.
  • urllib3 can't extract dynamic data.

How to scrape a web page using urllib3

Step 1: Create a PoolManager instance

Import the urllib3 library, then create a PoolManager instance and save it to a variable called http

import urllib3 
http = urllib3.PoolManager()

Once a PoolManager instance is created, you can make an HTTP get request by using the request() method on the PoolManager instance.

Step 2: Make a get request

Use the request method on the PoolManager instance. You can give the request method two parameters to make a simple get request. For the case of get requests, the first parameter is the string GET and the second is the string given by the URL you are trying to scrape:

r = http.request('GET', 'https://demo.vuestorefront.io/c/kitchen')

Step 3: Extract the data from the response object

The request response is given by an HTTPResponse object and, from this object, you can obtain information such as the status code, data and so on. Let's get the data by using the data method on the response object and BeautifulSoup:

soup = BeautifulSoup(r.data, 'html.parser')

To extract the data, use a for loop with the find_all method and the name of the item's class:

for item in soup.find_all('span', {'class': 'sf-product-card__title'}): 
	print(item.text)

Create a new file called urllib3Test.py with the following code:

import urllib3 
from bs4 import BeautifulSoup 
 
http = urllib3.PoolManager() 
 
r = http.request('GET', 'https://demo.vuestorefront.io/c/kitchen') 
soup = BeautifulSoup(r.data, 'html.parser') 
 
for item in soup.find_all('span', {'class': 'sf-product-card__title'}): 
	print(item.text)

And that's it! You have successfully scraped the data from the kitchen category on the Vue Storefront using the urllib3 Python web scraping library.

[Sample] Tiered Wire Basket 
[Sample] Oak Cheese Grater 
[Sample] 1 L Le Parfait Jar 
[Sample] Chemex Coffeemaker 3 Cup 
[Sample] Able Brewing System

Conclusion

There are different Python web scraping libraries that simplify a scraping process. In this article, we've shared the 5 best ones:
  1. ZenRows.
  2. Selenium.
  3. Requests.
  4. Scrapy.
  5. urllib3.

A common problem with web scraping libraries for Python is their inability to avoid bot detection while scraping a web page, and this makes scraping difficult and stressful. ZenRows solves this problem with a single API call. Take advantage of the current free trial and get 1,000 API credits for free.

Frequently Asked Questions

Why are Python libraries for web scraping important?

Python is one of the most popular languages that developers use to build web scrapers since its classes and objects are significantly easier to use compared to any other language. However trying to build a custom crawler from scratch on Python will be difficult, especially when you have to scrape a lot of custom websites and antibot measures are in place. Python web crawling libraries cut down the lengthy process and make it easy for you to scrape a web page.

Which libraries are used for web scraping in Python?

There are many Python web scraping libraries to choose from. The most popular ones are ZenRows, Selenium, Requests, Scrapy and urllib3.

What is the best Python web scraping library?

The best Python web scraping library to use is ZenRows. Of course, other libraries can get the job done, but the time and effort spent on learning these tools and the possibility of getting your scraper blocked are headaches that can be avoided easily with ZenRows.

The Requests library is one of the most used web scraping libraries since it helps make basic requests for further analysis.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.