Web Crawling Webinar for Tech Teams
Web Crawling Webinar for Tech Teams

How to Parse HTML With Python (Using The Top 6 Parsers)

Sergio Nonide
Sergio Nonide
October 7, 2024 · 7 min read

Parsing HTML can be challenging, especially when dealing with broken tags, inconsistent attributes, or deeply nested elements. Luckily, scraping with Python is made easier by various tools designed to extract the data you need.

In this article, we'll walk you through six of the best Python HTML parsers, ranging from beginner-friendly to more advanced options, with quick examples using the e-commerce demo page to show how they work.

Let's get right to it!

1. BeautifulSoup

BeautifulSoup is a beginner-friendly Python library used to parse HTML and XML documents. It creates a parse tree from the page source code, allowing you to easily navigate through parent and child elements. It works with other parsers like lxml, html.parser, or html5lib, to traverse the parse tree and extract data.

BeautifulSoup supports various selectors, including CSS selectors and tag selectors. It can also be combined with lxml to use XPath selectors for more complex or deeply nested elements. 

Since BeautifulSoup doesn't have a built-in HTTP client, it's commonly used with libraries like Requests or urllib to fetch web pages. It is also backed by an active community that provides comprehensive documentation and frequent updates.

Pros of BeautifulSoup

  • Easy to learn and use, especially for beginners.
  • Compatible with multiple parsers like lxml and html.parser.
  • Supports both tag-based and CSS selectors.
  • Actively maintained with strong community support.

Cons of BeautifulSoup

  • No support for JavaScript dynamic content rendering.
  • Slower compared to other parsers like lxml when handling larger documents.
  • Cannot bypass anti-bot measures on websites with strict protections.

How to parse HTML with BeautifulSoup

From the target website e-commerce demo page, let's extract the product name, price, and image URL.

To get started, install BeautifulSoup and Requests using pip3:

Terminal
pip3 install beautifulsoup4 requests

Right-click on the first product and inspect it. You’ll find the CSS selectors for the product details (product name, price, and image URL) you want to extract.

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Fetch the webpage content using the Requests library. Then, use BeautifulSoup's find_all method to locate the product details using its class names and loop through the elements to extract the product name, price and URLs:

Example
import requests
from bs4 import BeautifulSoup

# fetch the page content
url = 'https://www.scrapingcourse.com/ecommerce/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# extract product names, prices, and image URLs
product_names = soup.find_all('h2', class_='woocommerce-loop-product__title')
product_prices = soup.find_all('span', class_='woocommerce-Price-amount')
product_images = soup.find_all('img', class_='attachment-woocommerce_thumbnail')

# loop through the products and print the details
for name, price, img in zip(product_names, product_prices, product_images):
    print(f"Product: {name.text.strip()}")
    print(f"Price: {price.text.strip()}")
    print(f"Image URL: {img['src']}")
    print("-" * 35)

The above code will print the product names, prices and image URLs found on the page:

Output
[
    {
        'name': 'Abominable Hoodie',
        'price': '$69.00',
        'url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/',
        'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
    },
   
    # ... other products omitted for brevity

    {
        'name': 'Artemis Running Short',
        'price': '$45.00',
        'url': 'https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/',
        'img': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg'
    }
]

Great job! You've just parsed HTML with BeautifulSoup. Now, let's see how to use other tools.

2. ZenRows

ZenRows web scraping API is designed to extract data from websites without getting blocked. You can load JavaScript-rendered content, making it perfect for websites with dynamic elements. It also supports popular parsing methods like CSS selectors and XPath for precise data extraction.

ZenRows is well-equipped to handle poorly structured HTML and websites that frequently change their layout. Its intelligent parsers adapt to different data formats, ensuring you can extract the information you need regardless of how it's structured. 

ZenRows uses anti-bot bypass measures like CAPTCHA auto-bypass, premium proxy auto-rotation, request header optimization, and more to ensure you don't get blocked when scraping data.

It also uses residential proxies to access geo-restricted websites and retrieve location-specific content. The ZenRows team actively maintains the API, providing regular updates and support to guarantee optimal performance for your web scraping needs.

Pros of ZenRows

  • Bypass CAPTCHAs, IP bans, and other anti-bot measures.
  • Headless browsing allows for scraping dynamic content without rendering issues.
  • Flexible geo-targeting to access geo-restricted websites.
  • Supports both XPath and CSS selectors for precise data extraction.
  • Beginner-friendly with less coding required, making it easy to get started.
  • Compatible with any programming language.
  • Uses residential proxies to route your internet traffic through real residential IP addresses and avoid detection.

Cons of ZenRows

  • It's a paid service.

How to Parse HTML with ZenRows

Sign up and access the Request Builder. Enter the target URL into the link box and activate Premium Proxies and JS Rendering.  

Since we're parsing product details, click on CSS Selectors. In this tab, you'll need to provide the CSS selectors for the product data In a JSON format as shown below:

Example
{
    "price": "span.woocommerce-Price-amount",
    "product-name": "h2.woocommerce-loop-product__title",
    "images": "img @src"
}

Choose Python as your programming language and select the API connection mode. Copy and paste the generated code into your Python file:

building a scraper with zenrows
Click to open the image in full screen

The generated code should look like this:

Example
# pip3 install requests
import requests

# set the target URL
url = "https://www.scrapingcourse.com/ecommerce/"
apikey = "<YOUR_ZENROWS_API_KEY>"

# parameters for ZenRows
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
    "css_extractor": """{
        "price":"span.woocommerce-Price-amount",
        "product-name":"h2.woocommerce-loop-product__title",
        "images":"img @src"
    }""",
}

# fetch the data
response = requests.get("https://api.zenrows.com/v1/", params=params)

# output the response data
print(response.text)

Running this script will output the product details in a JSON format:

Output
{
  "images": [
    "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg",
    "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg",
    "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp07-black_main.jpg",
    "..."
  ],
  "price": [
    "$0.00",
    "$69.00",
    "$57.00",,
    "..."
  ],
  "product-name": [
    "Abominable Hoodie",
    "Adrienne Trek Jacket",
    "Aeon Capri",
    "..."
  ]
}

You have successfully parsed HTML using ZenRows. Good job!

3. lxml

lxml combines the ease of Python with the speed of C, making it a powerful and fast library for parsing both XML and HTML documents. It's significantly faster than many other Python-based parsers and offers better performance for large documents.

lxml supports both XPath and XSLT, offering flexibility for tasks that require querying and transforming documents.

When dealing with broken or poorly formatted HTML, lxml's built-in tree correction automatically fixes these issues, resulting in more accurate and cleaner parsing. It also provides advanced error handling and validation for XML documents, ensuring strict data formats are processed correctly. 

lxml can be combined with other parsers like BeautifulSoup for better flexibility when handling complex parsing tasks. The lxml library is actively maintained with regular updates and long-term support from a dedicated community, which guarantees its continued reliability and performance.

Pros of lxml

  • Fast and lightweight, especially for large or complex documents.
  • Full support for XPath and XSLT.
  • Flexible integration with BeautifulSoup for enhanced parsing capabilities.
  • Actively maintained with continuous updates.

Cons of lxml

  • Steeper learning curve than simpler parsers.
  • Requires external dependencies, making installation more complicated than other parsers.
  • Does not support JavaScript-rendered content.

How to Parse HTML with lxml

In your terminal, enter the following command to install lxml and Requests:

Terminal
pip3 install lxml requests

Then use Requests to fetch the HTML content from the target website and lxml's XPath selector to extract the name, price, and image:

Example
import requests
from lxml import html

# fetch page content
url = 'https://www.scrapingcourse.com/ecommerce/'
response = requests.get(url)
tree = html.fromstring(response.content)

# extract product names, prices, and image urls
product_names = tree.xpath('//h2[contains(@class, "woocommerce-loop-product__title")]/text()')
product_prices = tree.xpath('//span[@class="product-price woocommerce-Price-amount amount"]/bdi/text()')
product_images = tree.xpath('//img[contains(@class, "attachment-woocommerce_thumbnail")]/@src')

# loop through the products and print the details
for name, price, img in zip(product_names, product_prices, product_images):
    print(f"Product: {name}")
    print(f"Price: {price}")
    print(f"Image URL: {img}")
    print("-" * 35)

The code will give the same output as shown in the BeautifulSoup section. Let's move on to the next.

4. html5lib

html5lib is a pure Python library that follows the HTML5 specification, making it ideal for parsing poorly structured HTML that other parsers may struggle with. It builds the same DOM tree as a web browser, ensuring that the parsed HTML mirrors how the page would render in a browser.

While slower than other parsers like lxml, html5lib is a solid choice when browser-level accuracy is needed. 

It supports tag-based, CSS selector-based, and XPath-based extraction methods, offering flexibility in how users extract data from web pages. It can also be paired with BeautifulSoup for easier navigation through HTML documents. 

html5lib has a smaller community and limited documentation, which makes it challenging for beginners to learn and adopt.

Pros of html5lib

  • Follows the HTML5 specification precisely, making it highly accurate for complex or broken HTML.
  • Can handle poorly structured HTML better than most parsers.
  • Works well with other parsers like BeautifulSoup for additional flexibility.
  • Compatible with modern browsers.

Cons of html5lib

  • Slower than lxml or BeautifulSoup.
  • Limited documentation and community.
  • No JavaScript support.

How to Parse HTML with html5lib

Start by entering the following command in your terminal to Install BeautifulSoup and html5lib:

Terminal
pip3 install html5lib beautifulsoup4 requests

Then, fetch the page content with Requests, parse it with html5lib, and use BeautifulSoup to extract product name, price and image URLs from the e-commerce demo page:

Example
import requests
import html5lib
from html5lib.serializer import serialize
from bs4 import BeautifulSoup

# set the URL for the e-commerce page
url = 'https://www.scrapingcourse.com/ecommerce/'

# fetch the page content
response = requests.get(url)
html_content = response.text

# parse the HTML content with html5lib
doc = html5lib.parse(html_content)

# serialize the HTML content to make it accessible for BeautifulSoup
serialized_html = serialize(doc)
soup = BeautifulSoup(serialized_html, 'html5lib')

# extract product names, prices, and image URLs
product_names = soup.find_all('h2', {'class': 'woocommerce-loop-product__title'})
product_prices = soup.find_all('span', {'class': 'woocommerce-Price-amount'})
product_images = soup.find_all('img', {'class': 'attachment-woocommerce_thumbnail'})

# loop through the products and print the details
for name, price, img in zip(product_names, product_prices, product_images):
    print(f"Product: {name.text.strip()}")
    print(f"Price: {price.text.strip()}")
    print(f"Image URL: {img['src']}")
    print("-" * 35)

The code will give the same output as previous sections. Moving unto the next.

5. HTMLParser (Standard Library)

HTMLParser is provided by Python's standard library to parse HTML and XML documents. It does not require any external dependencies, making it a lightweight and easy-to-use option for basic HTML parsing tasks.

While it's not as powerful or feature-rich as other parsers like lxml or BeautifulSoup, it's a good choice for simple projects where you need to parse straightforward HTML content without installing extra packages.

Unlike other parsers, HTMLParser does not create a DOM tree. Instead, it processes HTML content as a stream, calling specific methods as it encounters start tags, end tags, data, or attributes. This provides control but requires more effort to extract structured data.

Though a reliable tool, it has limitations when dealing with unstructured HTML and does not support JavaScript-rendered content. Its smaller community and infrequent updates can also limit its usefulness for complex scraping projects.

Pros of HTMLParser

  • No external dependencies (built into Python's standard library).
  • Lightweight and fast for basic tasks.
  • Good control over the parsing process through an event-driven method.

Cons of HTMLParser

  • Limited flexibility compared to libraries like BeautifulSoup or lxml.
  • Doesn't handle broken or malformed HTML well.
  • Lacks JavaScript-rendered content support.
  • Smaller community and less frequent updates compared to third-party tools.

How to Parse HTML with HTMLParser

To extract the product details you want from the target website, you need first to install the Requests library to fetch the webpage content:

Terminal
pip3 install requests

Then, import the standard HTMLParser class and define methods to handle different parts of the document as it's being parsed:

Example
import requests
from html.parser import HTMLParser

# subclass HTMLParser to customize how we handle tags
class ProductParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.is_parsing_title = False
        self.is_parsing_price = False
        self.product_titles = []
        self.product_prices = []
        self.product_images = []

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        # detect product titles by checking for the class 'woocommerce-loop-product__title'
        if tag == 'h2' and 'class' in attrs and 'woocommerce-loop-product__title' in attrs['class']:
            self.is_parsing_title = True
        # detect prices by checking for the class 'woocommerce-Price-amount'
        if tag == 'span' and 'class' in attrs and 'woocommerce-Price-amount' in attrs['class']:
            self.is_parsing_price = True
        # detect images by checking for the class 'attachment-woocommerce_thumbnail'
        if tag == 'img' and 'class' in attrs and 'attachment-woocommerce_thumbnail' in attrs['class']:
            self.product_images.append(attrs['src'])

    def handle_endtag(self, tag):
        # reset flags at the end of the tag
        if tag == 'h2':
            self.is_parsing_title = False
        if tag == 'span':
            self.is_parsing_price = False

    def handle_data(self, data):
        # extract and save data when inside the correct tags
        if self.is_parsing_title:
            self.product_titles.append(data.strip())
        if self.is_parsing_price:
            self.product_prices.append(data.strip())

# fetch the page content
url = 'https://www.scrapingcourse.com/ecommerce/'
response = requests.get(url)

# create an instance of the parser and feed it the page content
parser = ProductParser()
parser.feed(response.text)

# loop through the products and print the details
for name, price, img in zip(parser.product_titles, parser.product_prices, parser.product_images):
    print(f"Product: {name}\nPrice: {price}\nImage URL: {img}")
    print("-----------------------------------")

The code will give the same output as previous sections. We're almost done—one more to go!

6. PyQuery

PyQuery is a Python library that allows you to make jQuery queries on XML and HTML documents. It's built on top of lxml and provides a jQuery-like syntax, making it intuitive for developers familiar with jQuery.

PyQuery offers flexible data extraction using CSS selectors, similar to how jQuery works in a browser environment. This makes it simple to traverse and manipulate the DOM. 

Unlike other Python parsers like BeautifulSoup, PyQuery has more advanced DOM manipulation capabilities that allow you to edit HTML documents directly. 

While actively maintained, its community and documentation are smaller compared to more widely used parsers like BeautifulSoup and lxml.

Pros of PyQuery

  • jQuery-like syntax, making it familiar and easy for web developers.
  • Works well for CSS selector-based HTML traversal and manipulation.
  • Supports both HTML and XML parsing.
  • Integrates smoothly with other libraries like Requests.

Cons of PyQuery

  • Slower than lxml when dealing with large datasets.
  • Less flexible than BeautifulSoup when handling messy or broken HTML.
  • Smaller community and fewer resources compared to other libraries.
  • No support for JavaScript-rendered content.

How to Parse HTML with PyQuery

To get started, you'll need to install both PyQuery and Requests using the following command:

Terminal
pip3 install pyquery requests

Fetch the page content and use the jQuery-style CSS selectors to extract the product details:

Example
# pip3 install pyquery requests
import requests
from pyquery import PyQuery as pq

# fetch the page content
url = 'https://www.scrapingcourse.com/ecommerce/'
response = requests.get(url)

# parse the content with pyquery
doc = pq(response.text)

# extract product names, prices, and image URLs using CSS selectors
product_names = doc('h2.woocommerce-loop-product__title').items()
product_prices = doc('span.woocommerce-Price-amount').items()
product_images = doc('img.attachment-woocommerce_thumbnail').items()

# loop through the products and print the details
for name, price, img in zip(product_names, product_prices, product_images):
    print(f"Product: {name.text().strip()}")
    print(f"Price: {price.text().strip()}")
    print(f"Image URL: {img.attr('src')}")
    print("-" * 35)

The code will give the same output as previous sections. This wraps up our list of parsers. 

Conclusion 

Python offers a variety of tools for HTML parsing, each with its unique strengths. BeautifulSoup and lxml excel in handling both structured and unstructured data, while html5lib ensures strict compliance with modern HTML standards. The built-in HTML parser provides a lightweight option for quick parsing tasks, and PyQuery offers a jQuery-like API for easy manipulation and querying of documents.

For most websites today, especially those with JavaScript-heavy content or frequent layout changes, ZenRows stands out as the better option for parsing HTML. Its ability to handle dynamic content, coupled with features like rotating proxies, residential proxies, CAPTCHA auto-bypass, etc., makes it a versatile and reliable solution for modern web scraping challenges.

Try ZenRows for free—no credit card needed.

Ready to get started?

Up to 1,000 URLs for free are waiting for you