Do you want to extract data from tables while scraping with Python?
Table parsing is one of the challenging aspects of web scraping. But we've got you covered! This article shows you the top 3 tools for parsing tables and teaches you how to extract data from HTML tables in Python, including the best overall solution to overcome the challenges of table parsing.
Let's go!
Why Is It Difficult to Parse Tables in Python?
Three main factors make table parsing challenging. Let's understand them below.
Complex Table Layouts
HTML tables on real websites often have complex layouts, making them difficult for beginners to scrape. These tables may include mixed data types, nested elements, merged cells, and other intricate structures that make table parsing difficult during scraping.
Inconsistent HTML Structure
Some websites may have poorly defined HTML structures, leading to distorted table layouts. This lack of a well-defined table structure can affect the output during web scraping, as the HTML parser might struggle to locate and interpret the expected elements, potentially leading to incomplete or incorrect data extraction.
Dynamic Content and JavaScript Rendering
Some websites render tables dynamically with JavaScript, making them inaccessible to standard HTML parser libraries like BeautifulSoup and Cheerio. You'll need JavaScript-enabled tools, such as headless browsers, to access and scrape such tables, complicating the scraping process further.
Best Python HTML Table Parsers
In this section, we've handpicked the best HTML parsers for extracting table data. If you want more generic HTML parsers, read our article on the best Python HTML parsers.
1. ZenRows
ZenRows is a web scraping API with all the toolkits required to scrape any website at scale without getting blocked. It features anti-bot and CAPTCHA auto-bypass, premium proxy auto-rotation, request header optimization, and more. ZenRows is compatible with any programming language and only requires a single API call. You can also access ZenRows residential proxy service under the same price cap.
With an HTML table auto-parser, ZenRows is suited for parsing all table types, regardless of their structural complexity. ZenRows also has headless browser features, allowing you to scrape dynamically rendered tables easily.
👍 Pros
- Table auto-parsing feature for easy table parsing.
- It returns the table data in JSON format.
- A complete toolkit to bypass any anti-bot measure at scale.
- Headless browsing for scraping dynamically rendered table content.
- Flexible geo-targeting to access geo-restricted content.
- Support for XPath and CSS selectors.
- Full screenshot functionality.
- Beginner-friendly and easy to use.
- It requires less coding.
- Compatible with any programming language.
👎 Cons
- It's a paid service but offers a free trial.
2. BeautifulSoup (+lxml or html.parser)
BeautifulSoup is a popular Python library used to parse HTML and XML documents. It works with different parsers, such as lxml
, html.parser
, and html5lib
, enabling you to traverse and manipulate the DOM tree easily. BeautifulSoup supports various selectors, including tag and CSS selectors. You can also pair it with lxml
to use XPath selectors, making it highly suitable for tasks like parsing HTML tables.
While BeautifulSoup doesn't have a built-in HTTP client for fetching web pages, it's commonly used with libraries like Requests to handle HTTP requests. One of the downsides of BeautifulSoup is that it doesn't support dynamic content scraping and can't handle anti-bot bypass.
👍 Pros
- Free HTML parser.
- Beginner-friendly.
- Lightweight.
- Support for various selectors.
- It pairs seamlessly with request clients.
- Extensive documentation.
👎 Cons
- No built-in HTTP clients.
- It cannot bypass anti-bot measures.
- It can't scrape JavaScript-rendered content.
- No auto-parsing feature.
3. Pandas
Pandas is a powerful data analysis and manipulation tool in Python. While not explicitly built for HTML parsing, it includes the read_html
function, which leverages parsers like lxml to read and convert HTML tables into clean DataFrame formats. This method is convenient for quickly extracting tables from web pages and working with them in Pandas. However, it doesn't support CSS selectors or XPath for more granular data extraction, making it less suitable for parsing specific table elements.
While read_html
can fetch and read HTML from a URL, it's not as robust as dedicated libraries like Requests, especially when dealing with issues like CORS (Cross-Origin Resource Sharing) or when custom headers are required. That said, you can combine Pandas with the Requests library and BeautifulSoup for more control over HTTP requests and fine-grained HTML table parsing.
👍 Pros
- Suitable for reading and parsing a whole table as a complete DataFrame.
- Faster table parsing.
- Built-in data cleaning and analysis tool.
- Works with Requests and BeautifulSoup.
👎 Cons
- It easily gets blocked by anti-bots and CORS.
- No robust HTTP client.
- It lacks support for CSS or XPath selectors.
- Unsuitable for parsing specific HTML table content.
- You can't use it to scrape dynamically rendered HTML tables.
How to Parse HTML Tables With Python
You've seen the best HTML table parsers. Now, it's time to get your hands on practical table data extraction. Using the Table Parsing Challenge page, you'll first learn to scrape table data using BeautifulSoup, a common HTML parser. Then, we'll move to ZenRows, a more straightforward and reliable solution.
Before we begin, see what the target page looks like below:
Let's scrape that table!
Parse an HTML Table Using BeautifulSoup
We'll use the Requests and BeautifulSoup libraries to scrape the table data on the target website. Install both packages using pip
:
pip3 install requests beautifulsoup4
The next step is to inspect the table to view its elements and selectors. Open the target website on a browser like Chrome, right-click the table, and select Inspect.
You'll see that the table has an ID of product-catalog
with each td
tag having a descriptive class name:
To scrape the table on the target website, obtain the website's HTML with the Requests library and parse the response with BeautifulSoup. Get the table using its class name, iterate through its rows using a for
loop, and extract each product's data. Here's the code to scrape the table data:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# request the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# find the table by its ID
table = soup.find("table", id="product-catalog")
# initialize a list to store the data
product_data = []
# iterate over each row in the body of the table
for row in table.find("tbody").find_all("tr"):
# extract each cell in the row
product_id = row.find("td", class_="product-id").text
name = row.find("td", class_="product-name").text
category = row.find("td", class_="product-category").text
price = row.find("td", class_="product-price").text
stock = row.find("td", class_="product-stock").text
# append the data to the list as a dictionary
product_data.append(
{
"Product ID": product_id,
"Name": name,
"Category": category,
"Price": price,
"In Stock": stock,
}
)
# print the scraped data
print(product_data)
The above code scrapes the table data and appends it to a list, as shown:
[
{
'Product ID': '001',
'Name': 'Laptop',
'Category': 'Electronics',
'Price': '$999.99',
'In Stock': 'Yes',
},
{
'Product ID': '002',
'Name': 'Smartphone',
'Category': 'Electronics',
'Price': '$599.99',
'In Stock': 'Yes',
},
# ... other products omitted for brevity,
{
'Product ID': '014',
'Name': 'Air Purifier',
'Category': 'Home',
'Price': '$129.99',
'In Stock': 'No',
},
{
'Product ID': '015',
'Name': 'Gaming Console',
'Category': 'Electronics',
'Price': '$399.99',
'In Stock': 'Yes',
},
]
Cool! You've just scraped an HTML table with BeautifulSoup in Python. However, a simpler way to achieve this task with less coding is to use a web scraping API like ZenRows. You'll see how it works in the next section.
Parse an HTML Table Using ZenRows
Parsing tables with ZenRows is a straightforward process.
Sign up to open the Request Builder and enter the target URL in the link box. Activate Premium Proxies and JS Renderin. Select Python as your programming language, and choose the API connection mode. Copy and paste the generated code into your Python script and add the outputs:tables
option to the existing request parameters.
Your code should look like the following after adding the outputs:tables
parameter:
# pip install requests
import requests
url = "https://www.scrapingcourse.com/table-parsing"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
"outputs": "tables",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
Run the above code. ZenRows will auto-parse the table and return a JSON response containing the required data, including the table's dimensions, headings, and individual row data. Here's the output:
{
"tables": [
{
"Dimensions": {"Rows": 15, "Cols": 5, "Headings": true},
"Headings": ["Product ID", "Name", "Category", "Price", "In Stock"],
"Content": [
{
"Category": "Electronics",
"In Stock": "Yes",
"Name": "Laptop",
"Price": "$999.99",
"Product ID": "001",
},
# ... other products omitted for brevity,
{
"Category": "Electronics",
"In Stock": "Yes",
"Name": "Gaming Console",
"Price": "$399.99",
"Product ID": "015",
},
],
}
]
}
Congratulations! You've used ZenRows to scrape a complete HTML table without worrying about tedious element inspection or the scraping logic.
Conclusion
You've seen the 3 top table parsing tools in Python and learned how to extract an HTML table using 2 of those tools. Well done!
Complex table layouts, inconsistent HTML structure, and dynamic table rendering are significant challenges while extracting table data. The free tools mentioned above (BeautifulSoup and Pandas) don't support JavaScript rendering and may be inefficient for parsing complex tables. Besides, they also can't bypass anti-bot protection.
The easiest way to extract any HTML table, regardless of its layout complexity or rendering style, is to use ZenRows. It handles table auto-parsing, anti-bot auto-bypass, JavaScript rendering, and more under the hood, allowing you to scrape at scale without limitations.
Try ZenRows for free now without a credit card!