Extracting data from tables presents one of the challenging parts of web scraping. This becomes even more difficult when dealing with complex HTML, dynamic website designs, or obfuscated HTML.
Having easy-to-understand steps at your disposal is crucial.
In this guide, we'll walk you through the steps to parse tables using BeautifulSoup and then explore two even better ways to make the task easier.
How to Parse Tables With BeautifulSoup
BeautifulSoup is a powerful Python parsing library, and here's why.
It creates a DOM tree that lets you quickly navigate and manipulate HTML documents. It also provides a high-level API that sits atop three different parsers (lxml, html5lib, and html.parser), allowing you to choose according to your project needs.
To parse tables using BeautifulSoup, you need to locate the table element, then iterate through each row and extract the data from the cells.
Below is a step-by-step guide.
Prerequisite
Before you start writing the code, you must install the required libraries.
First, we'll install the Python Requests package, which allows you to fetch the webpage. You can use any HTTP client of your choice.
Once you retrieve the page, you need to extract data from the target table. This is where BeautifulSoup and lxml come in.
We'll use the lxml parser because it's much faster than the others.
Navigate to your terminal and enter the following commands to install Requests, BeautifulSoup, and lxml.
pip3 install requests beautifulsoup4 lxml
You're all set.
Step 1: Fetch Target Webpage
Let's take a look at the table we'll be scraping. We'll use the ScrapingCourse's Table Parsing Challenge as the target website for this demonstration.
To fetch this page, import the necessary libraries. Then, using Requests, make a GET request to the target server and retrieve the response.
# import the necessary libraries
from bs4 import BeautifulSoup
import requests
# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text
print(html)
Step 2: Parse HTML Tables
Now that you have the raw HTML file, it's time to parse tables using BeautifulSoup.
Start by creating a BeautifulSoup object.
# ...
# create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')
This converts the HTML document into a parse tree, which allows you to interact with elements in a parent-child structure.
After that, locate the table element.
# ...
# select the table element
table = soup.find('table')
BeautifulSoup provides easy-to-use methods like find()
or find_all()
that allow you to select HTML elements using different criteria, including HTML tags and CSS selectors.
In this case (a demo website with only one table), we use the find()
method to select the target table using the <table>
tag.
However, you should inspect the page via the DevTools to identify the right selectors for more complex cases.
While using the selectors provided by the DevTools may work for testing purposes, they're sometimes less meaningful and can easily break. Thus, it's best to identify your selectors manually.
Lastly, select the table rows and iterate through each one to extract their text content.
# ...
# extract headers
headers = [th.text.strip() for th in table.find_all('th')]
# extract table body
rows = []
for row in table.find_all('tr')[1:]:
cells = [td.text.strip() for td in row.find_all('td')]
row.append(cells)
# log data
print("Headers:", headers)
print("table_body:")
for row in row:
print(row)
For simplicity, we've separated the headers from the table body.
The headers are the <th>
elements nested in the first row of the table (<tr>
). We select all <th>
elements and loop through them, extracting their text content.
Similarly, we select all
That's it.
Now, put all the steps together to get the complete code.
# import the necessary libraries
from bs4 import BeautifulSoup
import requests
# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text
# create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')
# select the table element
table = soup.find('table')
# extract headers
headers = [th.text.strip() for th in table.find_all('th')]
# extract table body
rows = []
for row in table.find_all('tr')[1:]:
cells = [td.text.strip() for td in row.find_all('td')]
rows.append(cells)
# log data
print("Headers:", headers)
print("table_body:")
for row in rows:
print(row)
This code prints the headers and data in subsequent rows.
Headers: ['Product ID', 'Name', 'Category', 'Price', 'In Stock']
table_body:
['001', 'Laptop', 'Electronics', '$999.99', 'Yes']
['002', 'Smartphone', 'Electronics', '$599.99', 'Yes']
['003', 'Headphones', 'Audio', '$149.99', 'No']
['004', 'Coffee Maker', 'Appliances', '$79.99', 'Yes']
# ... truncated for brevity
Congratulations! You now know how to parse tables using BeautifulSoup.
However, while the example above is pretty straightforward, parsing with BeautifulSoup can get complex, depending on your use case.
Modern websites often use dynamic designs that can result in page layout changes and class name changes. Not to mention the lack of "meaningful" selectors in real-world websites or the obfuscation and WAFs employed to deter web scraping.
To handle such scenarios, let's explore two more efficient alternatives.
Pro Alternative #1: Table Parsing Using Pandas
Pandas is an open-source data analysis and manipulation tool. It provides a read_html() function that parses tables directly into DataFrames.
DataFrames are similar to table structures and are more intuitive than manually extracting text content when using BeautifulSoup.
To parse tables using Pandas, load the HTML document into a list of DataFrames using the read_html()
method.
Here's a step-by-step guide.
Prerequisite
Install the Pandas library using the following command.
pip3 install pandas
Parse HTML Tables
Import the necessary libraries and parse the HTML document using the read_html()
method.
Passing literal HTML into the read_html()
method is deprecated. You need to wrap it in a StringIO
object. Thus, add an import for StringIO
.
# import the required libraries
import pandas as pd
from io import StringIO
import requests
# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text
# parse HTML
dataframes = pd.read_html(StringIO(html))
print(dataframes)
This code will return a list of DataFrames, which you can access by indexing. Each DataFrame corresponds to the tables on the target webpage. Since we have only one table in this example, this list contains only one DataFrame.
[ Product ID Name Category Price In Stock
0 1 Laptop Electronics $999.99 Yes
1 2 Smartphone Electronics $599.99 Yes
2 3 Headphones Audio $149.99 No
3 4 Coffee Maker Appliances $79.99 Yes
4 5 Running Shoes Sports $89.99 Yes
5 6 Smart Watch Electronics $249.99 Yes
...
]
Well done!
Pro Alternative #2: Parsing Tables With ZenRows
Unlike BeautifulSoup, ZenRows is much more than a parsing tool. It is a web scraping API that simplifies data extraction in complex scenarios. ZenRows automatically handles class name changes, obfuscation, and most potential challenges, allowing you to focus on extracting your desired data.
It returns the table data in JSON format and sections the output into dimensions, headings, and content. This can be useful for many reasons, including quick processing and transmitting data via APIs.
ZenRows requires less code compared to open-source tools like BeautifulSoup. You only need a single line of code, the outputs
parameter, to parse tables with ZenRows.
Here's a step-by-step guide using the same target webpage as before.
Prerequisite
To use ZenRows, you need an API key. Sign up for free to get yours.
You'll be redirected to the Request Builder page, where you'll find your ZenRows API key at the top right.
Parse HTML Tables
Input the target URL (https://www.scrapingcourse.com/table-parsing
). Then activate Premium Proxies and JS Rendering to handle advanced anti-bot systems.
That'll generate your request code on the right. Copy it to your code editor.
After that, set the outputs
parameter to tables, and ZenRows will automatically parse the tables on the page, returning dimensions, headings, and content.
Your complete code should look like this:
# import required library
import requests
url = 'https://www.scrapingcourse.com/table-parsing'
apikey = '<YOUR_ZENROWS_API_KEY>'
# set the necessary parameters
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
'outputs': 'tables'
}
# make GET request to target page and retrieve response
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Here's the result:
{
"Dimensions": {
"rows": 15,
"columns": 5,
"heading": true
},
"Headings": :["Product ID","Name","Category","Price","In Stock"],
"content": [
{"Category":"Electronics","In Stock":"Yes","Name":"Laptop","Price":"$999.99","Product ID":"001"},
{"Category":"Electronics","In Stock":"Yes","Name":"Smartphone","Price":"$599.99","Product ID":"002"},
{"Category":"Audio","In Stock":"No","Name":"Headphones","Price":"$149.99","Product ID":"003"},
...
]
}
Awesome, right? That's how easy it is to parse tables using ZenRows.
To try ZenRows for free, Sign up now.
Conclusion
Parsing tables can be one of the toughest web scraping challenges, especially when dealing with complex web layouts and advanced anti-bot systems.
While open-source tools like BeautifulSoup and Pandas have their use cases, the ZenRows web scraping API is the best approach for all cases. It's resilient to layout changes and can bypass any anti-bot system.
This makes ZenRows an excellent tool for parsing data in real-world scenarios where complex HTML and anti-bot systems dominate.
Try ZenRows for free now.