How to Use a Proxy with BeautifulSoup in 2024

May 23, 2024 · 6 min read

Python offers powerful libraries such as BeautifulSoup for parsing and Requests for scraping, but you're likely to get blocked because of restrictions such as IP banning and rate limiting. So in this tutorial, you'll learn to implement a BeautifulSoup proxy to avoid getting blocked.

Ready? Let's dive in!

First Steps with BeautifulSoup and Python Requests

For this example scraping with BeautifulSouip and Python Requests, we'll scrape products from ScrapingCourse.com a demo website with e-commmerce features.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

As a prerequisite, install BeautifulSoup and Requests using the following command:

Terminal
pip install beautifulsoup4 requests

Then, import the required modules.

scraper.py
import requests
from bs4 import BeautifulSoup

Now, send a GET request to the target URL, retrieve the web server's response, and save it in a variable.

scraper.py
url = "https://www.scrapingcourse.com/ecommerce/"
response = requests.get(url)
content = response.content

Let's print the response so we can see the full HTML we'll extract data from. Here's the complete code:

scraper.py
import requests
from bs4 import BeautifulSoup

url = "https://www.scrapingcourse.com/ecommerce/"
response = requests.get(url)
content = response.content

print(content)

And this is the output:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping – ScrapingCourse.com</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
    <ul class="products columns-4">

        <!--- ... --->

    </ul>
</body>
</html>

From the result above, we can observe a seemingly complex HTML that may not be useful as is. Therefore, to extract the products, we'll parse the HTML stored in the content variable using BeautifulSoup. This allows us to navigate through the HTML structure and retrieve the product names.

To continue and extract only the creature names, create a BeautifulSoup object to parse the content variable.

scraper.py
soup = BeautifulSoup(content, "html.parser")

Next, inspect the target URL on a browser using the DevTools to locate the HTML element that contains the product list. You should see it:

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

From the image above, each product appears in a list <li> tag. Right-click the <li> element and copy by selector: (.product).

After that, add BeautifulSoup's .select() method, which allows us to identify elements by CSS selectors to locate the specific

  • element that represents the product list within the parsed content. That involves entering your CSS selector as an argument in the select() method.

    scraper.py
    # select the product container
    products = soup.select(".product")
    

    To finish, iterate over the products <li> variable and extract the content of each <h2> using a list comprehension. This will retrieve the individual product names listed within the <li> items.

    scraper.py
    product_names = []
    
    # iterate over each <li> element in the product list
    for product in products:
        
        # use a list comprehension to extract the text content of each <h2> element
        product_names.extend([product.find("h2").get_text() for product in products])
    
    scraper.py
    for name in product_names:
        print(name)
    

    Putting everything together, here's the complete code.

    scraper.py
    import requests
    from bs4 import BeautifulSoup
    
    # define the URL of the website we want to scrape
    url = "https://www.scrapingcourse.com/ecommerce/"
    
    # send a GET request to the URL and retrieve the response
    response = requests.get(url)
    
    # extract the content of the response 
    content = response.content
    
    # create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response, "html.parser")
    
    # select the product container
    products = soup.select(".product")
    
    product_names = []
    
    # iterate over each <li> element in the product list
    for product in products:
    
        # use a list comprehension to extract the text content of each <h2> element
        product_names.extend([product.find("h2").get_text() for product in products])
    
    for name in product_names:
        print(name)
    

    Yielding the following result:

    Output
    Abominable Hoodie
    Adrienne Trek Jacket
    
    # ... other products omitted for brevity
    
    Ariel Roll Sleeve Sweatshirt
    Artemis Running Short
    

    Congrats, you now know how to scrape using BeautifulSoup and Python Requests.

    However, our target was a test website that allows scraping. Our script will get blocked in practical cases involving modern websites with restrictions, so let's move to use proxies with BeautifulSoup.

    Frustrated that your web scrapers are blocked once and again?
    ZenRows API handles rotating proxies and headless browsers for you.
    Try for FREE

    How to Use a Proxy with BeautifulSoup and Python Requests

    Proxies allow you to make requests from different IP addresses. As an example, send a request to ident.me using the following code.

    scraper.py
    import requests
    
    url = "http://ident.me/"
    
    response = requests.get(url)
    ip_address = response.text
    
    print("Your IP address is:", ip_address)
    

    Your result should be your IP address.

    Output
    Your IP address is: 190.158.1.38
    

    Now, let's make the same request using a proxy. For this example, we'll take any IP from FreeProxyList. And to implement it, we'll specify the proxy details in the script. That way, you'll be making your request through the specified proxy server.

    Import Requests and set your proxy.

    scraper.py
    import requests
    
    proxy = {
        "https": "https://91.25.93.174:3128"
    }
    

    Then, enter the proxy variable as a parameter in the request.get() method and print the response.

    scraper.py
    url = "http://ident.me/"
    
    response = requests.get(url, proxies=proxy)
    ip_address = response.text
    
    print("Your new IP address is:", ip_address)
    

    Putting it all together, you'll have the following complete code.

    scraper.py
    import requests
    
    proxy = {
        "https": "https://91.25.93.174:3128"
    }
    
    url = "http://ident.me/"
    
    response = requests.get(url, proxies=proxy)
    ip_address = response.text
    
    print("Your new IP address is:", ip_address)
    

    Here's our result:

    Output
    Your IP address is: 91.25.93.170
    

    Congrats, you've configured your first proxy with BeautifulSoup and Python Requests. The result above is the proxy server's IP address, meaning that the request was successfully routed through the specified proxy.

    However, websites often implement measures like rate limiting and IP banning. Therefore, you must rotate proxies to avoid getting flagged.

    To rotate proxies with BeautifulSoup and Python Requests, start by defining a proxy list. Once more, we've obtained a list from FreeProxyList for this example.

    scraper.py
    import requests
    
    # list of proxies
    proxies = [
        "http://46.16.201.51:3129",
        "http://207.2.120.19:80",
        "http://50.227.121.35:80",
        # add more proxies as needed
    ]
    

    After that, iterate over each proxy in the proxies list, make a GET request using the current proxy, and print the response.

    scraper.py
    for proxy in proxies:
        try:
            # make a GET request to the specified URL using the current proxy
            response = requests.get(url, proxies={"http": proxy, "https": proxy})
            
            # extract the IP address from the response content
            ip_address = response.text
            
            # print the obtained IP address
            print("Your IP address is:", ip_address)
    
    	except requests.exceptions.RequestException as e:
            print(f"Request failed with proxy {proxy}: {str(e)}")
            continue  # move to the next proxy if the request fails
    

    Putting everything together, here's the complete code.

    scraper.py
    import requests
    
    # list of proxies
    proxies = [
        "http://46.16.201.51:3129",
        "http://207.2.120.19:80",
        "http://50.227.121.35:80",
        # add more proxies as needed
    ]
    
    url = "http://ident.me/"
    
    for proxy in proxies:
        try:
            # make a GET request to the specified URL using the current proxy
            response = requests.get(url, proxies={"http": proxy, "https": proxy})
            
            # extract the IP address from the response content
            ip_address = response.text
            
            # print the obtained IP address
            print("Your IP address is:", ip_address)
    
    	except requests.exceptions.RequestException as e:
            print(f"Request failed with proxy {proxy}: {str(e)}")
            continue  # move to the next proxy if the request fails
    

    Here's our result:

    Output
    Your IP address is: 46.16.201.51
    Your IP address is: 207.2.120.19
    Your IP address is: 50.227.121.35
    

    Awesome, right? Now you know how to configure a BeautifulSoup proxy and also how to rotate proxies to avoid getting blocked.

    That said, there's a lot more on this topic than that. Check out our guide on how to use a proxy with Python Requests to learn more.

    Also, bear in mind free proxies are unreliable and often fail in practical use cases. We only used them in this example to show you the basics. For example, if you replace ident.me with OpenSea, you'll get error messages, as seen below.

    scraper.py
    Request failed with proxy http://46.16.201.51:3129: HTTPSConnectionPool(host='opensea.io', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x0000027FCC4BA710>, 'Connection to 46.16.201.51 timed out. (connect timeout=None)'))
    
    Request failed with proxy http://207.2.120.19:80: HTTPSConnectionPool(host='opensea.io', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 503 Service Temporarily Unavailable')))
    
    Request failed with proxy http://50.227.121.35:80: HTTPSConnectionPool(host='opensea.io', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x0000027FCC4C0850>, 'Connection to 50.227.121.35 timed out. (connect timeout=None)'))
    

    Fortunately, premium proxies yield better results. Let's use them next.

    Premium Proxy to Avoid Getting Blocked with BeautifulSoup

    Premium proxies for scraping used to be expensive, particularly for large-scale use cases. However, the introduction of solutions like ZenRows has made them more accessible. And besides that, it comes with many features to avoid being blocked, like JavaScript rendering, header rotation and advanced anti-bot bypass measures.

    To try ZenRows, sign up to get your free API key. You'll get to the Request Builder, where you have to input your target URL (https://www.opensea.io/), check the boxes for premium proxies, anti-bot, and JavaScript rendering, and select Python.

    ZenRows Request Builder
    Click to open the image in full screen

    That will generate the code on the right. Copy it to your IDE to send your request using Python Requests.

    Your new script should look like this.

    scraper.py
    import requests
    
    url = "https://opensea.io/"
    apikey = "Your API Key"
    params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "antibot": "true",
    "premium_proxy": "true",
    }
    response = requests.get("https://api.zenrows.com/v1/", params=params)
    print(response.text)
    

    Here's the result:

    Output
    <!DOCTYPE html>
    //..
    <title>OpenSea, the largest NFT marketplace</title>
    

    Bingo! You're all set. ZenRows makes web scraping super easy.

    Conclusion

    Proxies act as intermediaries between your web scraper and the target web server. You can access websites from different IP addresses and bypass restrictions by routing your requests through proxies. However, free proxies do not work for real-world cases.

    A great option is using a Python Requests and BeautifulSoup proxy with ZenRows for effective and scalable web scraping. Sign up now and enjoy 1,000 free API credits.

  • Ready to get started?

    Up to 1,000 URLs for free are waiting for you