Python Lxml for Web Scraping: Step-by-Step 2024 Tutorial

May 15, 2024 · 8 min read

Do you want to use the lxml library for your next Python web scraping project? We’ve got you covered.

In this article, you'll learn how to extract content from a website using lxml and Python's Requests. You'll also see how to combine both libraries to scrape a paginated website.

Let’s go!

What Is Lxml?

Lxml is a Python library for handling HTML and XML documents. It creates XML content and provides a parse tree for reading HTML and XML documents, allowing you to access and manipulate web elements while web scraping with Python.

How to Parse HTML Using Lxml?

In this section, you'll go through five steps of extracting content from scrapingcourse.com, a demo e-commerce website.

You'll start with the full-page HTML and extract the title to learn single-element extraction. Then, you'll scale your scraper to extract multiple pieces of content and write them to a CSV.

Take a quick look at the website's layout before moving to the tutorial:

Demo Page
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 1: Install Lxml and Cssselect

Before starting, you have to install two libraries:

  • The lxml library itself, since it's not a standard Python package.
  • **cssselect**, a third-party CSS selecto. While lxml has a built-in XPath selector, it tends to be inaccurate. cssselec` will make element location easier.

Install both libraries using pip:

Terminal
pip install lxml cssselect

Step 2: Get the HTML Before Parsing

Now that you've installed the libraries, it's time to get the full-page HTML before parsing it with lxml. This step ensures that your HTTP client obtains the page content as expected.

Although you can use other Python HTTP clients, this tutorial uses Python's Requests library. Install it using pip:

Terminal
pip install requests

Now, let's get the page's HTML. Request the target website and print its content:

scraper.py
# import the required libraries
import requests

# open the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

# validate the response status
if response.status_code == 200:

    # print the content if successful
    print(response.text)
else:
    print(f"{response.status_code}, unable to process request")

The code outputs the website's HTML content:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>An ecommerce store to scrape – Scraping Laboratory</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 116 of 188 results</p>
    <ul class="products columns-4">
        <!--- ... --->
      
        <li>
            <h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
            <span class="price">
                <span class="woocommerce-Price-amount amount">
                    <bdi>
                        <span class="woocommerce-Price-currencySymbol">$</span>69.00
                    </bdi>
                </span>
            </span>
            <a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
        </li>
      
        <!--- ... other products omitted for brevity --->
    </ul>
</body>
</html>

Step 3: Extract a Single Element

To extract a single element, you need to point lxml to a specific element's attribute and get its content. Let's scrape the target page's title to see how it works.

First, import the required libraries and open the target web page with the Requests library:

scraper.py
# import the required libraries
from lxml import html
import requests

# open the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

Validate the request and parse the HTML content with lxml. Once it's parsed, obtain the title tag using XPath or CSS selectors.

In this case, go with lxml's findtext method, which uses XPath to locate the first matching element under the hood:

scraper.py
# ...

# validate the response status
if response.status_code == 200:

    # parse the HTML content
    tree = html.fromstring(response.content)

    # obtain the title element using XPath
    title = tree.findtext(".//title")
    
    # print the result
    print(title)
    
else:
    print(f"{response.status_code}, unable to process request")
 to join this conversation on GitHub. Already have an acco

Merge both snippets, and you should get the following complete code:

scraper.py
# import the required libraries
from lxml import html
import requests

# open the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

# validate the response status
if response.status_code == 200:

    # parse the HTML content
    tree = html.fromstring(response.content)

    # obtain the title element using XPath
    title = tree.findtext(".//title")
    
    # print the result
    print(title)
    
else:
    print(f"{response.status_code}, unable to process request")

The code outputs the page title, as expected:

Output
An ecommerce store to scrape – Scraping Laboratory

Good job! Now, let's extract more elements.

Step 4: Extract Multiple Elements

Extracting multiple elements requires parsing their attributes with the lxml library. In this tutorial, you'll extract the first product's name, price, and image source.

Before you begin, right-click the first product on the page and click “Inspect” to view its attributes:

DevTools Inspection
Click to open the image in full screen

Let's extract the first product. Import the required libraries, open the target website with the Requests library, and validate its response. Then, parse the HTML with lxml:

scraper.py
# import the required library
from lxml import html
import requests

# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

# validate the response status
if response.status_code == 200:

    # parse the HTML content
    tree = html.fromstring(response.content)
else:
    print(f"{response.status_code}, unable to process request")

Extract the target data from the product element with the CSS selector. The cssselect method returns an element list, so ensure you index it to return the first product on the list, as shown below:

Output
# ...

# scrape the target content
product = {
    "name": tree.cssselect("h2.woocommerce-loop-product__title")[0].text_content(),
    "price": tree.cssselect("span.price")[0].text_content(),
    "image_source": tree.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src"),
}

# output the content
print(product)

Combine the two snippets, and your complete code should look like this:

scraper.py
# import the required library
from lxml import html
import requests

# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

# validate the response status
if response.status_code == 200:

    # parse the HTML content
    tree = html.fromstring(response.content)
else:
    print(f"{response.status_code}, unable to process request")

# scrape the target content
product = {
    "name": tree.cssselect("h2.woocommerce-loop-product__title")[0].text_content(),
    "price": tree.cssselect("span.price")[0].text_content(),
    "image_source": tree.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src"),
}

# output the content
print(product)

The code above extracts the first product's name, price, and image URL:

Output
{
  'name': 'Abominable Hoodie', 
  'price': '$69.00', 
  'image_source2': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
}

Next, let's tweak the code to get all the products on the first page.

Step 5: Extract All Matching Elements From a Page

Extracting all matching elements involves looping through all product containers to obtain the target elements.

Let's inspect the web page to view the container elements. Each product is included inside a list tag:

Container DevTools Inspection
Click to open the image in full screen

To extract the names, prices, and image URLs from all the products on the first page, you'll build on the previous code.

Request the target web page, verify the request status, and parse the returned HTML with lxml:

scraper.py
# import the required library
from lxml import html
import requests

# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

# validate the response status
if response.status_code == 200:

    # parse the HTML content
    tree = html.fromstring(response.content)
else:
    print(f"{response.status_code}, unable to process request")

Obtain the product container using the CSS selector. Declare an empty list to write the extracted data:

scraper.py
# ...

# obtain the product container
containers = tree.cssselect("li.product")

# declare an empty list to collate extracted data
data = []

Iterate through that container with a for loop to extract the desired content into a dictionary using the CSS selector. Then, append the extracted data to the empty list and print it.

scraper.py
# ...

# loop through the product container
for container in containers:

    # declare an empty dictionary
    item_data = {}

    # scrape the target content from the current container
    item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
    item_data["price"] = container.cssselect("span.price")[0].text_content()
    item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
    
    # append the extracted data to the list
    data.append(item_data)

# output the extracted content
print(data)

Combine your snippets to get the complete code below:

scraper.py
# import the required library
from lxml import html
import requests

# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

# validate the response status
if response.status_code == 200:

    # parse the HTML content
    tree = html.fromstring(response.content)
else:
    print(f"{response.status_code}, unable to process request")

# obtain the product container
containers = tree.cssselect("li.product")

# declare an empty list to collate extracted data
data = []

# loop through the product container
for container in containers:

    # declare an empty dictionary
    item_data = {}

    # scrape the target content from the current container
    item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
    item_data["price"] = container.cssselect("span.price")[0].text_content()
    item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
    
    # append the extracted data to the list
    data.append(item_data)

# output the extracted content
print(data)

The code outputs all the products on the first page, as shown:

Output
[
  {
    'name': 'Abominable Hoodie', 
    'price': '$69.00', 
    'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
  },
  
  # ... other products omitted for brevity

  {
    'name': 'Artemis Running Short', 
    'price': '$45.00', 
    'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg'
  }
]

You just scraped a full page with lxml and the Requests library. Let's complete the task by writing the data to a CSV file.

Step 6: Export to CSV

Writing the extracted data to CSV helps you organize it for further processing. Let's change the previous code to export the extracted content to a CSV file.

First, add Python's csv package to your imported libraries. Then, specify the column names and write the data into rows inside the CSV file. Save it to your project directory.

Modify your previous code with the following snippet to achieve that:

scraper.py
# import the required library
from lxml import html
import requests
import csv

# ...

# define the fieldnames for the CSV file
field_names = ["name", "price", "image_source"]

# write the data to a CSV file
with open("products.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=field_names)
    
    # write the header row
    writer.writeheader()
    
    # write the data rows
    for row_data in data:
        writer.writerow(row_data)

print("Data has been written to products.csv")

The complete code should look like this:

scraper.py
# import the required library
from lxml import html
import requests
import csv

# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")

# validate the response status
if response.status_code == 200:

    # parse the HTML content
    tree = html.fromstring(response.content)
else:
    print(f"{response.status_code}, unable to process request")

# obtain the product container
containers = tree.cssselect("li.product")

# declare an empty list to collate extracted data
data = []

# loop through the product container
for container in containers:

    # declare an empty dictionary
    item_data = {}

    # scrape the target content from the current container
    item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
    item_data["price"] = container.cssselect("span.price")[0].text_content()
    item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
    
    # append the extracted data to the list
    data.append(item_data)

# define the fieldnames for the CSV file
field_names = ["name", "price", "image_source"]

# write the data to a CSV file
with open("products.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=field_names)
    
    # write the header row
    writer.writeheader()
    
    # write the data rows
    for row_data in data:
        writer.writerow(row_data)

print("Data has been written to products.csv")

The code above exports the extracted data into a CSV file. See the result below:

CSV file
Click to open the image in full screen

Your script now exports the extracted content to a CSV file. Congratulations! However, your scraper is still limited to the first page. You'll need to build a Python web crawler to scrape more pages.

Scrape Pagination With Lxml and Requests

Pagination scraping lets you navigate each page on a website and extract data from it. It involves following the “Next page” button on the navigation bar using its href attribute.

The current target website breaks content into pages using a navigation bar. Let's inspect the site to see the next-page button element:

Next-Page Button DevTools Inspection
Click to open the image in full screen

To follow the hrefs and scrape more pages from the target website, you’ll implement Python’s Request pagination

Define a scraper function that accepts a URL argument and sets the initial page count to zero. Open the target URL and validate its response. Then, parse the returned HTML with lxml:

scraper.py
# import the required libraries
from lxml import html
import requests

def scraper(url, page_count=0):
    # send a request to the target website
    response = requests.get(url)

    # validate the response status
    if response.status_code == 200:
        
        # parse the HTML content
        tree = html.fromstring(response.content)
    else:
        return f"{response.status_code}, unable to process request"

Retrieve the product containers and iterate through each to extract its content into a dictionary. Append the data to an empty list and return the scraped data after nine more iterations. Then, increment the page count.

scraper.py
def scraper(url, page_count=0):
    
    # ...
    
    # obtain the product container
    containers = tree.cssselect("li.product")

    # declare an empty list to collate extracted data
    data = []

    # loop through the product container
    for container in containers:
        # scrape the target content within the current container
        item_data = {}
        item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
        item_data["price"] = container.cssselect("span.price")[0].text_content()
        item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
        
        data.append(item_data)

    # check if you've reached page 10, if so, return the data without recursing further
    if page_count >= 9:
        return data

    # increment page_count
    page_count += 1

Next, obtain the next page button's href attribute and recurse the scraper function. Lastly, execute the function and print the extracted data.

scraper.py
def scraper(url, page_count=0):
  
    # ...
   
    # get the next page link
    next_link_element = tree.cssselect("a.next.page-numbers")

    # validate the next page element and extract the next page link
    if next_link_element:

        # extract the next page URL
        next_link = next_link_element[0].get("href")
        print(f"Scraping from: {next_link}")

        # recurse the function on the next page link if it exists
        data.extend(scraper(next_link, page_count))

    return data

# run the scraper function and print the extracted data
result_data = scraper("https://scrapingcourse.com/ecommerce/")
print(result_data)

Here's the complete code:

Output
# import the required libraries
from lxml import html
import requests

def scraper(url, page_count=0):
    # send a request to the target website
    response = requests.get(url)

    # validate the response status
    if response.status_code == 200:
        
        # parse the HTML content
        tree = html.fromstring(response.content)
    else:
        return f"{response.status_code}, unable to process request"
    
    # obtain the product container
    containers = tree.cssselect("li.product")

    # declare an empty list to collate extracted data
    data = []

    # loop through the product container
    for container in containers:
        # scrape the target content within the current container
        item_data = {}
        item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
        item_data["price"] = container.cssselect("span.price")[0].text_content()
        item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
        
        data.append(item_data)

    # check if you've reached page 10, if so, return the data without recursing further
    if page_count >= 9:
        return data

    # increment page_count
    page_count += 1

    # get the next page link
    next_link_element = tree.cssselect("a.next.page-numbers")

    # validate the next page element and extract the next page link
    if next_link_element:

        # extract the next page URL
        next_link = next_link_element[0].get("href")
        print(f"Scraping from: {next_link}")

        # recurse the function on the next page link if it exists
        data.extend(scraper(next_link, page_count))

    return data

# run the scraper function and print the extracted data
result_data = scraper("https://scrapingcourse.com/ecommerce/")
print(result_data)

The code scrapes products' names, prices, and image URLs from the first ten pages:

Output
[
  {
    'name': 'Abominable Hoodie', 
    'price': '$69.00', 
    'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
  },
  
  # ... other products omitted for brevity
  
  {
    'name': 'Sprite Stasis Ball 75 cm', 
    'price': '$32.00',
    'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/luma-stability-ball-324x324.jpg'
  }
]

Great job! Your crawler can now scrape a paginated website with lxml.

Conclusion

In this tutorial, you've learned how to extract content from a website using the lxml parser and Python's Requests library. Here's what the process looks like:

  • Obtaining web page content before parsing its HTML with lxml.
  • Scraping a single element from a web page.
  • Extracting multiple items from a specific web element.
  • Scaling content extraction to scrape all matching elements across a web page.
  • Exporting the scraped data to a CSV file.
  • Navigating subsequent pages and extracting content from them.

However, no matter how sophisticated your script is, websites’ protection mechanisms can still block it and prevent you from scraping at scale.

To bypass all anti-bot detection, we recommend integrating ZenRows, an all-in-one web scraping solution, into your web scraper. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you