How to Parse Tables Using BeautifulSoup (+2 better ways)

September 27, 2024 · 4 min read

Table of contents

How to parse tables with BeautifulSoup
- Prerequisite
- Fetch target webpage
- Parse HTML tables
Parsing tables using Pandas
- Prerequisite
- Parse HTML tables
Parsing tables with ZenRows
- Prerequisite
- Parse HTML tables
Conclusion

One of the most difficult parts of web scraping that people hardly admit is extracting data from tables. And this becomes even more difficult when dealing with complex HTML, dynamic website designs, or obfuscated HTML.

That said, it's important for you to understand how it works. This will really help you avoid unnecessary scraping issues.

We'll show you how to parse tables using BeautifulSoup. And you'll also learn two better ways to make the task easier.

How to Parse Tables With BeautifulSoup

BeautifulSoup is mainly used to parse HTML in Python. The first step to parsing tables is to locate the table's HTML tag. After that, you can now go ahead to loop through its columns to collect data from each cell.

Before you start writing the code, you must install the required libraries.

Prerequisite

First, we'll install the Python Requests package, which allows you to fetch the webpage. You can use any HTTP client of your choice.

Once you retrieve the page, you need to extract data from the target table. This is where BeautifulSoup and lxml come in.

We'll use the lxml parser because it's much faster than the others.

Navigate to your terminal and enter the following commands to install Requests, BeautifulSoup, and lxml.

                    Terminal
                
pip3 install requests beautifulsoup4 lxml

Copied!

You're all set.

Step 1: Fetch Target Webpage

Let's take a look at the table we'll be scraping. We'll use the ScrapingCourse's Table Parsing Challenge as the target website for this demonstration.

To fetch this page, import the necessary libraries. Then, using Requests, make a GET request to the target server and retrieve the response.

                    Example
                
# import the necessary libraries
from bs4 import BeautifulSoup
import requests 

# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text
print(html)

Copied!

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for you.

Try for FREE

Step 2: Parse HTML Tables

Now that you have the raw HTML file, it's time to parse tables using BeautifulSoup.

Start by creating a BeautifulSoup object.

                    Example
                
# ...

# create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')

Copied!

This converts the HTML document into a parse tree, which allows you to interact with elements in a parent-child structure.

After that, locate the table element.

                    Example
                
# ...

# select the table element
table = soup.find('table')

Copied!

BeautifulSoup provides easy-to-use methods like find() or find_all() that allow you to select HTML elements using different criteria, including HTML tags and CSS selectors.

In this case (a demo website with only one table), we use the find() method to select the target table using the <table> tag.

However, you should inspect the page via the DevTools to identify the right selectors for more complex cases.

Note

While using the selectors provided by the DevTools may work for testing purposes, they're sometimes less meaningful and can easily break. Thus, it's best to identify your selectors manually.

Lastly, select the table rows and iterate through each one to extract their text content.

                    Example
                
# ...

# extract headers 
headers = [th.text.strip() for th in table.find_all('th')]

# extract table body
rows = []
for row in table.find_all('tr')[1:]:  
    cells = [td.text.strip() for td in row.find_all('td')]
    row.append(cells)

# log data
print("Headers:", headers)
print("table_body:")
for row in row:
    print(row)

  
  

  
Copied!

For simplicity, we've separated the headers from the table body.

The headers are the <th> elements nested in the first row of the table (<tr>). We select all <th> elements and loop through them, extracting their text content.

Similarly, we select all tags (cells) nested in subsequent tags and loop through each one to extract their text content.

That's it.

Now, put all the steps together to get the complete code.

                    Example
                
# import the necessary libraries
from bs4 import BeautifulSoup
import requests 

# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text

# create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')

# select the table element
table = soup.find('table')

# extract headers 
headers = [th.text.strip() for th in table.find_all('th')]

# extract table body
rows = []
for row in table.find_all('tr')[1:]:  
    cells = [td.text.strip() for td in row.find_all('td')]
    rows.append(cells)

# log data
print("Headers:", headers)
print("table_body:")
for row in rows:
    print(row)

  
  

  
Copied!

This code prints the headers and data in subsequent rows.

                    Output
                
Headers: ['Product ID', 'Name', 'Category', 'Price', 'In Stock']
table_body:
['001', 'Laptop', 'Electronics', '$999.99', 'Yes']
['002', 'Smartphone', 'Electronics', '$599.99', 'Yes']
['003', 'Headphones', 'Audio', '$149.99', 'No']
['004', 'Coffee Maker', 'Appliances', '$79.99', 'Yes']
# ... truncated for brevity

  
  

  
Copied!

Congratulations! You now know how to parse tables using BeautifulSoup.

However, while the example above is pretty straightforward, parsing with BeautifulSoup can get complex, depending on your use case.

Modern websites often use dynamic designs that can result in page layout changes and class name changes. Not to mention the lack of "meaningful" selectors in real-world websites or the obfuscation and WAFs like Cloudflare employed to deter web scraping.

To handle such scenarios, let's explore two more efficient alternatives.

Pro Alternative #1: Table Parsing Using Pandas

Pandas is an open-source data analysis and manipulation tool. It provides a read_html() function that parses tables directly into DataFrames.

DataFrames are similar to table structures and are more intuitive than manually extracting text content when using BeautifulSoup.

To parse tables using Pandas, load the HTML document into a list of DataFrames using the read_html() method.

Here's a step-by-step guide.

Prerequisite

Install the Pandas library using the following command.

                    Terminal
                
pip3 install pandas

Copied!

Parse HTML Tables

Import the necessary libraries and parse the HTML document using the read_html() method.

Note

Passing literal HTML into the read_html() method is deprecated. You need to wrap it in a StringIO object. Thus, add an import for StringIO.

                    Example
                
# import the required libraries
import pandas as pd
from io import StringIO
import requests

# make a GET request to the target website
response = requests.get("https://www.scrapingcourse.com/table-parsing")
# retrieve the response
html = response.text

# parse HTML 
dataframes = pd.read_html(StringIO(html))
print(dataframes)

  
  

  
Copied!

This code will return a list of DataFrames, which you can access by indexing. Each DataFrame corresponds to the tables on the target webpage. Since we have only one table in this example, this list contains only one DataFrame.

                    Output
                
[    Product ID                 Name       Category    Price In Stock
0            1               Laptop    Electronics  $999.99      Yes
1            2           Smartphone    Electronics  $599.99      Yes
2            3           Headphones          Audio  $149.99       No
3            4         Coffee Maker     Appliances   $79.99      Yes
4            5        Running Shoes         Sports   $89.99      Yes
5            6          Smart Watch    Electronics  $249.99      Yes
...
]

  
  

  
Copied!

Well done!

Pro Alternative #2: Parsing Tables With ZenRows

Unlike BeautifulSoup, ZenRows is much more than a parsing tool. It is a web scraping API that simplifies data extraction in complex scenarios. ZenRows automatically handles class name changes, obfuscation, and most potential challenges, allowing you to focus on extracting your desired data.

It returns the table data in JSON format and sections the output into dimensions, headings, and content. This can be useful for many reasons, including quick processing and transmitting data via APIs.

ZenRows requires less code compared to open-source tools like BeautifulSoup. You only need a single line of code, the outputs parameter, to parse tables with ZenRows.

Here's a step-by-step guide using the same target webpage as before.

Prerequisite

To use ZenRows, you need an API key. Sign up for free to get yours.

You'll be redirected to the Request Builder page, where you'll find your ZenRows API key at the top right.

building a scraper with zenrows — Click to open the image in full screen

Parse HTML Tables

Input the target URL (https://www.scrapingcourse.com/table-parsing). Then activate Premium Proxies and JS Rendering to handle advanced anti-bot systems.

That'll generate your request code on the right. Copy it to your code editor.

After that, set the outputs parameter to tables, and ZenRows will automatically parse the tables on the page, returning dimensions, headings, and content.

Your complete code should look like this:

                    Example
                
# import required library
import requests

url = 'https://www.scrapingcourse.com/table-parsing'
apikey = '<YOUR_ZENROWS_API_KEY>'
# set the necessary parameters
params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
    'outputs': 'tables'
}
# make GET request to target page and retrieve response
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

  
  

  
Copied!

Here's the result:

                    Output
                
{
  "Dimensions": {
    "rows": 15,
    "columns": 5,
    "heading": true
  },
  "Headings": :["Product ID","Name","Category","Price","In Stock"],
  "content": [
    {"Category":"Electronics","In Stock":"Yes","Name":"Laptop","Price":"$999.99","Product ID":"001"},
    {"Category":"Electronics","In Stock":"Yes","Name":"Smartphone","Price":"$599.99","Product ID":"002"},
    {"Category":"Audio","In Stock":"No","Name":"Headphones","Price":"$149.99","Product ID":"003"},
  ...
  ]
}

  
  

  
Copied!

Awesome, right? That's how easy it is to parse tables using ZenRows.

To try ZenRows for free, Sign up now.

Conclusion

Parsing tables can be one of the toughest web scraping challenges, especially when dealing with complex web layouts and advanced anti-bot systems.

While open-source tools like BeautifulSoup and Pandas have their use cases, the ZenRows web scraping API is the best approach for all cases. It's resilient to layout changes and can bypass any anti-bot system.

This makes ZenRows an excellent tool for parsing data in real-world scenarios where complex HTML and anti-bot systems dominate.
Try ZenRows for free now.

How to Parse Tables With BeautifulSoup

Prerequisite

Step 1: Fetch Target Webpage

Step 2: Parse HTML Tables

Pro Alternative #1: Table Parsing Using Pandas

Prerequisite

Parse HTML Tables

Pro Alternative #2: Parsing Tables With ZenRows

Prerequisite

Parse HTML Tables

Conclusion

Ready to get started?