How to Parse Tables Using BeautifulSoup (+2 better ways)

Idowu Omisola
Idowu Omisola
September 27, 2024 · 4 min read

Extracting data from tables presents one of the challenging parts of web scraping. This becomes even more difficult when dealing with complex HTML, dynamic website designs, or obfuscated HTML.

Having easy-to-understand steps at your disposal is crucial.

In this guide, we'll walk you through the steps to parse tables using BeautifulSoup and then explore two even better ways to make the task easier.

How to Parse Tables With BeautifulSoup 

BeautifulSoup is a powerful Python parsing library, and here's why.

It creates a DOM tree that lets you quickly navigate and manipulate HTML documents. It also provides a high-level API that sits atop three different parsers (lxml, html5lib, and html.parser), allowing you to choose according to your project needs.

To parse tables using BeautifulSoup, you need to locate the table element, then iterate through each row and extract the data from the cells.

Below is a step-by-step guide. 

Prerequisite

Before you start writing the code, you must install the required libraries.

First, we'll install the Python Requests package, which allows you to fetch the webpage. You can use any HTTP client of your choice.

Once you retrieve the page, you need to extract data from the target table. This is where BeautifulSoup and lxml come in.

We'll use the lxml parser because it's much faster than the others.

Navigate to your terminal and enter the following commands to install Requests, BeautifulSoup, and lxml.

Terminal
pip3 install requests beautifulsoup4 lxml

You're all set.

Step 1: Fetch Target Webpage

Let's take a look at the table we'll be scraping. We'll use the ScrapingCourse's Table Parsing Challenge as the target website for this demonstration.

Table Parsing Challenge
Click to open the image in full screen

To fetch this page, import the necessary libraries. Then, using Requests, make a GET request to the target server and retrieve the response.

Example
# import the necessary libraries
from bs4 import BeautifulSoup
import requests 

# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text
print(html)
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 2: Parse HTML Tables

Now that you have the raw HTML file, it's time to parse tables using BeautifulSoup.

Start by creating a BeautifulSoup object.

Example
# ...

# create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')

This converts the HTML document into a parse tree, which allows you to interact with elements in a parent-child structure.

After that, locate the table element.

Example
# ...

# select the table element
table = soup.find('table')

BeautifulSoup provides easy-to-use methods like find() or find_all() that allow you to select HTML elements using different criteria, including HTML tags and CSS selectors.

In this case (a demo website with only one table), we use the find() method to select the target table using the <table> tag.

However, you should inspect the page via the DevTools to identify the right selectors for more complex cases.

Lastly, select the table rows and iterate through each one to extract their text content.

Example
# ...

# extract headers 
headers = [th.text.strip() for th in table.find_all('th')]

# extract table body
rows = []
for row in table.find_all('tr')[1:]:  
    cells = [td.text.strip() for td in row.find_all('td')]
    row.append(cells)

# log data
print("Headers:", headers)
print("table_body:")
for row in row:
    print(row)

For simplicity, we've separated the headers from the table body.

The headers are the <th> elements nested in the first row of the table (<tr>). We select all <th> elements and loop through them, extracting their text content.

Similarly, we select all tags (cells) nested in subsequent tags and loop through each one to extract their text content.

That's it.

Now, put all the steps together to get the complete code.

Example
# import the necessary libraries
from bs4 import BeautifulSoup
import requests 

# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text

# create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')

# select the table element
table = soup.find('table')

# extract headers 
headers = [th.text.strip() for th in table.find_all('th')]

# extract table body
rows = []
for row in table.find_all('tr')[1:]:  
    cells = [td.text.strip() for td in row.find_all('td')]
    rows.append(cells)

# log data
print("Headers:", headers)
print("table_body:")
for row in rows:
    print(row)

This code prints the headers and data in subsequent rows.

Output
Headers: ['Product ID', 'Name', 'Category', 'Price', 'In Stock']
table_body:
['001', 'Laptop', 'Electronics', '$999.99', 'Yes']
['002', 'Smartphone', 'Electronics', '$599.99', 'Yes']
['003', 'Headphones', 'Audio', '$149.99', 'No']
['004', 'Coffee Maker', 'Appliances', '$79.99', 'Yes']
# ... truncated for brevity

Congratulations! You now know how to parse tables using BeautifulSoup.

However, while the example above is pretty straightforward, parsing with BeautifulSoup can get complex, depending on your use case.

Modern websites often use dynamic designs that can result in page layout changes and class name changes. Not to mention the lack of "meaningful" selectors in real-world websites or the obfuscation and WAFs employed to deter web scraping.

To handle such scenarios, let's explore two more efficient alternatives.

Pro Alternative #1: Table Parsing Using Pandas

Pandas is an open-source data analysis and manipulation tool. It provides a read_html() function that parses tables directly into DataFrames.

DataFrames are similar to table structures and are more intuitive than manually extracting text content when using BeautifulSoup.

To parse tables using Pandas, load the HTML document into a list of DataFrames using the read_html() method.

Here's a step-by-step guide.

Prerequisite

Install the Pandas library using the following command.

Terminal
pip3 install pandas

Parse HTML Tables

Import the necessary libraries and parse the HTML document using the read_html() method.

Example
# import the required libraries
import pandas as pd
from io import StringIO
import requests

# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text

# parse HTML 
dataframes = pd.read_html(StringIO(html))
print(dataframes)

This code will return a list of DataFrames, which you can access by indexing. Each DataFrame corresponds to the tables on the target webpage. Since we have only one table in this example, this list contains only one DataFrame.

Output
[    Product ID                 Name       Category    Price In Stock
0            1               Laptop    Electronics  $999.99      Yes
1            2           Smartphone    Electronics  $599.99      Yes
2            3           Headphones          Audio  $149.99       No
3            4         Coffee Maker     Appliances   $79.99      Yes
4            5        Running Shoes         Sports   $89.99      Yes
5            6          Smart Watch    Electronics  $249.99      Yes
...
]

Well done! 

Pro Alternative #2: Parsing Tables With ZenRows

Unlike BeautifulSoup, ZenRows is much more than a parsing tool. It is a web scraping API that simplifies data extraction in complex scenarios. ZenRows automatically handles class name changes, obfuscation, and most potential challenges, allowing you to focus on extracting your desired data.

It returns the table data in JSON format and sections the output into dimensions, headings, and content. This can be useful for many reasons, including quick processing and transmitting data via APIs.

ZenRows requires less code compared to open-source tools like BeautifulSoup. You only need a single line of code, the outputs parameter, to parse tables with ZenRows.

Here's a step-by-step guide using the same target webpage as before.

Prerequisite

To use ZenRows, you need an API key. Sign up for free to get yours.

You'll be redirected to the Request Builder page, where you'll find your ZenRows API key at the top right.

building a scraper with zenrows
Click to open the image in full screen

Parse HTML Tables

Input the target URL (https://www.scrapingcourse.com/table-parsing). Then activate Premium Proxies and JS Rendering to handle advanced anti-bot systems.

That'll generate your request code on the right. Copy it to your code editor.

After that, set the outputs parameter to tables, and ZenRows will automatically parse the tables on the page, returning dimensions, headings, and content.

Your complete code should look like this:

Example
# import required library
import requests

url = 'https://www.scrapingcourse.com/table-parsing'
apikey = '<YOUR_ZENROWS_API_KEY>'
# set the necessary parameters
params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
    'outputs': 'tables'
}
# make GET request to target page and retrieve response
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Here's the result:

Output
{
  "Dimensions": {
    "rows": 15,
    "columns": 5,
    "heading": true
  },
  "Headings": :["Product ID","Name","Category","Price","In Stock"],
  "content": [
    {"Category":"Electronics","In Stock":"Yes","Name":"Laptop","Price":"$999.99","Product ID":"001"},
    {"Category":"Electronics","In Stock":"Yes","Name":"Smartphone","Price":"$599.99","Product ID":"002"},
    {"Category":"Audio","In Stock":"No","Name":"Headphones","Price":"$149.99","Product ID":"003"},
  ...
  ]
}

Awesome, right? That's how easy it is to parse tables using ZenRows.

To try ZenRows for free, Sign up now.

Conclusion

Parsing tables can be one of the toughest web scraping challenges, especially when dealing with complex web layouts and advanced anti-bot systems.

While open-source tools like BeautifulSoup and Pandas have their use cases, the ZenRows web scraping API is the best approach for all cases. It's resilient to layout changes and can bypass any anti-bot system.

This makes ZenRows an excellent tool for parsing data in real-world scenarios where complex HTML and anti-bot systems dominate.
Try ZenRows for free now.

Ready to get started?

Up to 1,000 URLs for free are waiting for you