How to Set Up a Proxy With MechanicalSoup

July 3, 2024 · 8 min read

MechanicalSoup is a Python library for automating website interactions. It's built on top of BeautifulSoup and Requests, popular tools in the web scraping community.

While MechanicalSoup is effective for web scraping using Python, it doesn't prevent your scraper from getting blocked by the websites' anti-bot systems.

Fortunately, there are a few methods to boost MechanicalSoup anti-detection powers. In this tutorial, you'll learn a step-by-step process of how to set up proxies in MechanicalSoup.

Set Up a Single Proxy With MechanicalSoup

As a prerequisite, install the MechanicalSoup library if you haven't already:

Terminal
pip install MechanicalSoup

Before setting up the proxy, let's make a simple HTTP GET request to https://httpbin.io/ip. This website returns the IP address of the client making the request.

Import the mechanicalsoup module to your code and create a browser object. Next, send a GET request to the target URL and print the response text.

scraper.py
import mechanicalsoup

# create a browser object
browser = mechanicalsoup.StatefulBrowser()

# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip")

print(response.text)

The code will print your machine's IP address:

Output
{
  "origin": "50.173.55.144:30127"
}

Exposing your IP address is not a good idea, as websites may block it due to scraping activities.

Let's set up a proxy to reduce the chances of being detected and blocked.

Start with grabbing a free proxy from the Free Proxy List website.

Next, define a proxy in your code pointing to the IP address and port number (e.g., http://8.219.97.248:80). This ensures that HTTP and HTTPS requests are routed through this proxy.

scraper.py
# define proxies using this syntax:
# <PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>
proxies = {
    "https": "http://8.219.97.248:80",
    "http": "http://8.219.97.248:80",
}

Finally, pass the proxies dictionary to the browser object you defined before. Here's what your code should look like:

scraper.py
import mechanicalsoup

# define proxies using this syntax:
# <PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>
proxies = {
    "https": "http://8.219.97.248:80",
    "http": "http://8.219.97.248:80",
}

# create a browser object
browser = mechanicalsoup.StatefulBrowser()

# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip", proxies=proxies)

print(response.text)

The code will output the IP address of the used proxy server:

Output
{
  "origin": "8.219.64.236:60924"
}

Congrats! You've just changed the IP address of your MechanicalSoup scraper. Let's move to more advanced concepts.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Proxy Authentication

Some proxy servers require authentication to grant access only to users with valid credentials. It's usually the case with commercial solutions or premium proxies.

Here's the syntax to specify credentials (username and password) for an authenticated proxy:

Example
<PROXY_PROTOCOL>://<USERNAME>:<PASSWORD>@<PROXY_IP_ADDRESS>:<PROXY_PORT>

This is what your updated code with proxy authentication should look like:

scraper.py
import mechanicalsoup

# define proxies
proxies = {
    "https": "http://<YOUR_USERNAME>:<YOUR_PASSWORD>@72.10.160.173:3985",
    "http": "http://<YOUR_USERNAME>:<YOUR_PASSWORD>@72.10.160.173:3985",
}

# create a browser object
browser = mechanicalsoup.StatefulBrowser()

# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip", proxies=proxies)

print(response.text)

Add Rotating and Premium Proxies to MechanicalSoup

If you make multiple requests in a short period using a single proxy, the websites you're trying to access can detect this behavior and block you.

To avoid getting blocked, you can use a rotating proxy. This means changing proxies after a certain amount of time or number of requests, making you appear as a different user each time.

Create a list of proxies using the same Free Proxy List website:

Example
# create a list of proxies
PROXIES = [
    "http://8.219.97.248:80",
    "http://148.72.140.24:30127",
    # ...
    "http://77.238.235.219:8080"
]

Next, create a function that randomly selects the proxies from the list and returns a dictionary object. You can use the random.choice() method for this.

scraper.py
# ...
import random

# ...

# function to randomly select and return proxies
def rotate_proxy():
    https_proxy = random.choice(PROXIES)
    http_proxy = random.choice(PROXIES)

    return {
        "https": https_proxy,
        "http": http_proxy
        }

# ...

# rotate proxies
proxies = rotate_proxy()

# ...

Here's your final rotating proxy code:

scraper.py
import mechanicalsoup
import random

# create a list of proxies
PROXIES = [
    "http://8.219.97.248:80",
    "http://148.72.140.24:30127",
    # ...
    "http://77.238.235.219:8080"
]

# function to randomly select and return proxies
def rotate_proxy():
    https_proxy = random.choice(PROXIES)
    http_proxy = random.choice(PROXIES)

    return {
        "https": https_proxy,
        "http": http_proxy
        }

# create a browser object
browser = mechanicalsoup.StatefulBrowser()

# rotate proxies
proxies = rotate_proxy()

# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip", proxies=proxies)

print(response.text)

Each time you run this code, the script randomly picks a proxy from the list.

Output
# request 1
{
  "origin": "8.219.64.236:64632"
}

# request 2
{
  "origin": "77.238.235.219:8080"
}

# request 3
{
  "origin": "148.72.140.24:30127"
}

Congratulations! You've successfully implemented the rotating proxies functionality.

However, while free proxies work well for the sake of this excercise, it's not advisable to use them in production. Not only are they short-lived, but they also won't protect you from websites’ anti-bot protection systems in most cases.

Let's try to send request a protected website like G2 Reviews using the above approach.

scraper.py
import mechanicalsoup
import random

# create a list of proxies
PROXIES = [
    "http://31.186.239.245:8080",
    "http://5.78.50.231:8888",
    # ...
    "http://52.4.247.252:8129"
]

# function to randomly select and return proxies
def rotate_proxy():
    https_proxy = random.choice(PROXIES)
    http_proxy = random.choice(PROXIES)

    return {
        "https": https_proxy,
        "http": http_proxy
        }

# create a browser object
browser = mechanicalsoup.StatefulBrowser()

# rotate proxies
proxies = rotate_proxy()

# send a GET request
response = browser.session.request("get", "https://www.g2.com/products/asana/reviews", proxies=proxies)

print(response.status_code)

The server will respond with 403 status code:

Terminal
403

This means G2 Reviews detected your rotating proxy request as a bot and responded with a 403 Forbidden error.

So, what's the solution? Use premium proxies!

Premium proxies provide a complete automated process for anti-bot bypassing. You can scrape more efficiently and use precise geolocation while remaining anonymous. If you need help deciding which service to choose, check out our list of the best premium proxy providers.

Let's use the recommended solution, ZenRows, to access the protected G2 Reviews website that previously blocked us.

To get started with ZenRows, sign up for free. After logging in, you'll get redirected to the Requests Builder page. Paste the G2 Reviews URL in the URL to Scrape box. Enable JS Rendering and click on the Premium Proxies check box. Select Python as your language and click on the Proxy tab. Finally, copy the generated premium proxy.

ZenRows Request Builder
Click to open the image in full screen

You need to modify your previous code to integrate ZenRows. Here's the final code to access a protected page using ZenRows premium proxies and MechanicalSoup:

scraper.py
import mechanicalsoup

# paste your generated ZenRows premium proxy here
proxy = "http://<YOUR_ZENROWS_API_KEY>:js_render=true&[email protected]:8001"
proxies = {
    "https": proxy,
    "http": proxy
}

# create a browser object
browser = mechanicalsoup.StatefulBrowser()

# send a GET request
response = browser.session.request("get", "https://www.g2.com/products/asana/reviews", proxies=proxies, verify=False)

print(response.status_code)

When you run this code, you'll get the 200 status code as output.

Output
200

The above output confirms you successfully bypassed anti-bot protection with ZenRows premium proxies. Congratulations!

Conclusion

This step-by-step tutorial showed how to set up a proxy in MechanicalSoup.

Now you know:

  • The basics of setting a proxy with MechanicalSoup in Python.
  • How to deal with proxy authentication.
  • How to use a rotating proxy.
  • How to implement a premium proxy and bypass anti-bot systems.

Avoid the hassle of finding and configuring proxies. Using ZenRows, you can bypass any anti-bot protection and increase the reliability of your scraper. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you