Do you want to use the lxml library for your next Python web scraping project? We’ve got you covered.
In this article, you'll learn how to extract content from a website using lxml and Python's Requests. You'll also see how to combine both libraries to scrape a paginated website.
Let’s go!
What Is Lxml?
Lxml is a Python library for handling HTML and XML documents. It creates XML content and provides a parse tree for reading HTML and XML documents, allowing you to access and manipulate web elements while web scraping with Python.
How to Parse HTML Using Lxml?
In this section, you'll go through five steps of extracting content from scrapingcourse.com, a demo e-commerce website.
You'll start with the full-page HTML and extract the title to learn single-element extraction. Then, you'll scale your scraper to extract multiple pieces of content and write them to a CSV.
Take a quick look at the website's layout before moving to the tutorial:
Step 1: Install Lxml and Cssselect
Before starting, you have to install two libraries:
- The lxml library itself, since it's not a standard Python package.
-
**cssselect**
, a third-party CSS selecto. While lxml has a built-in XPath selector, it tends to be inaccurate.cssselec
` will make element location easier.
Install both libraries using pip
:
pip install lxml cssselect
Step 2: Get the HTML Before Parsing
Now that you've installed the libraries, it's time to get the full-page HTML before parsing it with lxml. This step ensures that your HTTP client obtains the page content as expected.
Although you can use other Python HTTP clients, this tutorial uses Python's Requests library. Install it using pip
:
pip install requests
Now, let's get the page's HTML. Request the target website and print its content:
# import the required libraries
import requests
# open the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
# validate the response status
if response.status_code == 200:
# print the content if successful
print(response.text)
else:
print(f"{response.status_code}, unable to process request")
The code outputs the website's HTML content:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>An ecommerce store to scrape – Scraping Laboratory</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
<ul class="products columns-4">
<!--- ... --->
<li>
<h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
<span class="price">
<span class="woocommerce-Price-amount amount">
<bdi>
<span class="woocommerce-Price-currencySymbol">$</span>69.00
</bdi>
</span>
</span>
<a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
</li>
<!--- ... other products omitted for brevity --->
</ul>
</body>
</html>
We recommend complementing your Python scraper with ZenRows to avoid getting blocked while extracting content from real websites.
Step 3: Extract a Single Element
To extract a single element, you need to point lxml to a specific element's attribute and get its content. Let's scrape the target page's title to see how it works.
First, import the required libraries and open the target web page with the Requests library:
# import the required libraries
from lxml import html
import requests
# open the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
Validate the request and parse the HTML content with lxml. Once it's parsed, obtain the title tag using XPath or CSS selectors.
In this case, go with lxml's findtext
method, which uses XPath to locate the first matching element under the hood:
# ...
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
# obtain the title element using XPath
title = tree.findtext(".//title")
# print the result
print(title)
else:
print(f"{response.status_code}, unable to process request")
to join this conversation on GitHub. Already have an acco
Merge both snippets, and you should get the following complete code:
# import the required libraries
from lxml import html
import requests
# open the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
# obtain the title element using XPath
title = tree.findtext(".//title")
# print the result
print(title)
else:
print(f"{response.status_code}, unable to process request")
The code outputs the page title, as expected:
An ecommerce store to scrape – Scraping Laboratory
Good job! Now, let's extract more elements.
Step 4: Extract Multiple Elements
Extracting multiple elements requires parsing their attributes with the lxml library. In this tutorial, you'll extract the first product's name, price, and image source.
Before you begin, right-click the first product on the page and click “Inspect” to view its attributes:
Let's extract the first product. Import the required libraries, open the target website with the Requests library, and validate its response. Then, parse the HTML with lxml:
# import the required library
from lxml import html
import requests
# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
else:
print(f"{response.status_code}, unable to process request")
Extract the target data from the product element with the CSS selector. The cssselect
method returns an element list, so ensure you index it to return the first product on the list, as shown below:
# ...
# scrape the target content
product = {
"name": tree.cssselect("h2.woocommerce-loop-product__title")[0].text_content(),
"price": tree.cssselect("span.price")[0].text_content(),
"image_source": tree.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src"),
}
# output the content
print(product)
Combine the two snippets, and your complete code should look like this:
# import the required library
from lxml import html
import requests
# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
else:
print(f"{response.status_code}, unable to process request")
# scrape the target content
product = {
"name": tree.cssselect("h2.woocommerce-loop-product__title")[0].text_content(),
"price": tree.cssselect("span.price")[0].text_content(),
"image_source": tree.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src"),
}
# output the content
print(product)
The code above extracts the first product's name, price, and image URL:
{
'name': 'Abominable Hoodie',
'price': '$69.00',
'image_source2': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
}
Next, let's tweak the code to get all the products on the first page.
Step 5: Extract All Matching Elements From a Page
Extracting all matching elements involves looping through all product containers to obtain the target elements.
Let's inspect the web page to view the container elements. Each product is included inside a list tag:
To extract the names, prices, and image URLs from all the products on the first page, you'll build on the previous code.
Request the target web page, verify the request status, and parse the returned HTML with lxml:
# import the required library
from lxml import html
import requests
# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
else:
print(f"{response.status_code}, unable to process request")
Obtain the product container using the CSS selector. Declare an empty list to write the extracted data:
# ...
# obtain the product container
containers = tree.cssselect("li.product")
# declare an empty list to collate extracted data
data = []
Iterate through that container with a for
loop to extract the desired content into a dictionary using the CSS selector. Then, append the extracted data to the empty list and print it.
# ...
# loop through the product container
for container in containers:
# declare an empty dictionary
item_data = {}
# scrape the target content from the current container
item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
item_data["price"] = container.cssselect("span.price")[0].text_content()
item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
# append the extracted data to the list
data.append(item_data)
# output the extracted content
print(data)
Combine your snippets to get the complete code below:
# import the required library
from lxml import html
import requests
# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
else:
print(f"{response.status_code}, unable to process request")
# obtain the product container
containers = tree.cssselect("li.product")
# declare an empty list to collate extracted data
data = []
# loop through the product container
for container in containers:
# declare an empty dictionary
item_data = {}
# scrape the target content from the current container
item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
item_data["price"] = container.cssselect("span.price")[0].text_content()
item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
# append the extracted data to the list
data.append(item_data)
# output the extracted content
print(data)
The code outputs all the products on the first page, as shown:
[
{
'name': 'Abominable Hoodie',
'price': '$69.00',
'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
},
# ... other products omitted for brevity
{
'name': 'Artemis Running Short',
'price': '$45.00',
'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg'
}
]
You just scraped a full page with lxml and the Requests library. Let's complete the task by writing the data to a CSV file.
Step 6: Export to CSV
Writing the extracted data to CSV helps you organize it for further processing. Let's change the previous code to export the extracted content to a CSV file.
First, add Python's csv
package to your imported libraries. Then, specify the column names and write the data into rows inside the CSV file. Save it to your project directory.
Modify your previous code with the following snippet to achieve that:
# import the required library
from lxml import html
import requests
import csv
# ...
# define the fieldnames for the CSV file
field_names = ["name", "price", "image_source"]
# write the data to a CSV file
with open("products.csv", "w", newline="", encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=field_names)
# write the header row
writer.writeheader()
# write the data rows
for row_data in data:
writer.writerow(row_data)
print("Data has been written to products.csv")
The complete code should look like this:
# import the required library
from lxml import html
import requests
import csv
# send a request to the target website
response = requests.get("https://scrapingcourse.com/ecommerce/")
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
else:
print(f"{response.status_code}, unable to process request")
# obtain the product container
containers = tree.cssselect("li.product")
# declare an empty list to collate extracted data
data = []
# loop through the product container
for container in containers:
# declare an empty dictionary
item_data = {}
# scrape the target content from the current container
item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
item_data["price"] = container.cssselect("span.price")[0].text_content()
item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
# append the extracted data to the list
data.append(item_data)
# define the fieldnames for the CSV file
field_names = ["name", "price", "image_source"]
# write the data to a CSV file
with open("products.csv", "w", newline="", encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=field_names)
# write the header row
writer.writeheader()
# write the data rows
for row_data in data:
writer.writerow(row_data)
print("Data has been written to products.csv")
The code above exports the extracted data into a CSV file. See the result below:
Your script now exports the extracted content to a CSV file. Congratulations! However, your scraper is still limited to the first page. You'll need to build a Python web crawler to scrape more pages.
Scrape Pagination With Lxml and Requests
Pagination scraping lets you navigate each page on a website and extract data from it. It involves following the “Next page” button on the navigation bar using its href
attribute.
The current target website breaks content into pages using a navigation bar. Let's inspect the site to see the next-page button element:
To follow the hrefs
and scrape more pages from the target website, you’ll implement Python’s Request pagination
Define a scraper function that accepts a URL argument and sets the initial page count to zero. Open the target URL and validate its response. Then, parse the returned HTML with lxml:
# import the required libraries
from lxml import html
import requests
def scraper(url, page_count=0):
# send a request to the target website
response = requests.get(url)
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
else:
return f"{response.status_code}, unable to process request"
Retrieve the product containers and iterate through each to extract its content into a dictionary. Append the data to an empty list and return the scraped data after nine more iterations. Then, increment the page count.
def scraper(url, page_count=0):
# ...
# obtain the product container
containers = tree.cssselect("li.product")
# declare an empty list to collate extracted data
data = []
# loop through the product container
for container in containers:
# scrape the target content within the current container
item_data = {}
item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
item_data["price"] = container.cssselect("span.price")[0].text_content()
item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
data.append(item_data)
# check if you've reached page 10, if so, return the data without recursing further
if page_count >= 9:
return data
# increment page_count
page_count += 1
Next, obtain the next page button's href
attribute and recurse the scraper function. Lastly, execute the function and print the extracted data.
def scraper(url, page_count=0):
# ...
# get the next page link
next_link_element = tree.cssselect("a.next.page-numbers")
# validate the next page element and extract the next page link
if next_link_element:
# extract the next page URL
next_link = next_link_element[0].get("href")
print(f"Scraping from: {next_link}")
# recurse the function on the next page link if it exists
data.extend(scraper(next_link, page_count))
return data
# run the scraper function and print the extracted data
result_data = scraper("https://scrapingcourse.com/ecommerce/")
print(result_data)
Here's the complete code:
# import the required libraries
from lxml import html
import requests
def scraper(url, page_count=0):
# send a request to the target website
response = requests.get(url)
# validate the response status
if response.status_code == 200:
# parse the HTML content
tree = html.fromstring(response.content)
else:
return f"{response.status_code}, unable to process request"
# obtain the product container
containers = tree.cssselect("li.product")
# declare an empty list to collate extracted data
data = []
# loop through the product container
for container in containers:
# scrape the target content within the current container
item_data = {}
item_data["name"] = container.cssselect("h2.woocommerce-loop-product__title")[0].text_content()
item_data["price"] = container.cssselect("span.price")[0].text_content()
item_data["image_source"] = container.cssselect("img.attachment-woocommerce_thumbnail")[0].get("src")
data.append(item_data)
# check if you've reached page 10, if so, return the data without recursing further
if page_count >= 9:
return data
# increment page_count
page_count += 1
# get the next page link
next_link_element = tree.cssselect("a.next.page-numbers")
# validate the next page element and extract the next page link
if next_link_element:
# extract the next page URL
next_link = next_link_element[0].get("href")
print(f"Scraping from: {next_link}")
# recurse the function on the next page link if it exists
data.extend(scraper(next_link, page_count))
return data
# run the scraper function and print the extracted data
result_data = scraper("https://scrapingcourse.com/ecommerce/")
print(result_data)
The code scrapes products' names, prices, and image URLs from the first ten pages:
[
{
'name': 'Abominable Hoodie',
'price': '$69.00',
'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg'
},
# ... other products omitted for brevity
{
'name': 'Sprite Stasis Ball 75 cm',
'price': '$32.00',
'image_source': 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/luma-stability-ball-324x324.jpg'
}
]
Great job! Your crawler can now scrape a paginated website with lxml.
Conclusion
In this tutorial, you've learned how to extract content from a website using the lxml parser and Python's Requests library. Here's what the process looks like:
- Obtaining web page content before parsing its HTML with lxml.
- Scraping a single element from a web page.
- Extracting multiple items from a specific web element.
- Scaling content extraction to scrape all matching elements across a web page.
- Exporting the scraped data to a CSV file.
- Navigating subsequent pages and extracting content from them.
However, no matter how sophisticated your script is, websites’ protection mechanisms can still block it and prevent you from scraping at scale.
To bypass all anti-bot detection, we recommend integrating ZenRows, an all-in-one web scraping solution, into your web scraper. Try ZenRows for free!