MechanicalSoup is a Python library for automating website interactions. It's built on top of BeautifulSoup and Requests, popular tools in the web scraping community.
While MechanicalSoup is effective for web scraping using Python, it doesn't prevent your scraper from getting blocked by the websites' anti-bot systems.
Fortunately, there are a few methods to boost MechanicalSoup anti-detection powers. In this tutorial, you'll learn a step-by-step process of how to set up proxies in MechanicalSoup.
Set Up a Single Proxy With MechanicalSoup
As a prerequisite, install the MechanicalSoup library if you haven't already:
pip install MechanicalSoup
Before setting up the proxy, let's make a simple HTTP GET request to https://httpbin.io/ip
. This website returns the IP address of the client making the request.
Import the mechanicalsoup
module to your code and create a browser object. Next, send a GET request to the target URL and print the response text.
import mechanicalsoup
# create a browser object
browser = mechanicalsoup.StatefulBrowser()
# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip")
print(response.text)
The code will print your machine's IP address:
{
"origin": "50.173.55.144:30127"
}
Exposing your IP address is not a good idea, as websites may block it due to scraping activities.
Let's set up a proxy to reduce the chances of being detected and blocked.
Start with grabbing a free proxy from the Free Proxy List website.
The proxies used in this tutorial may not work at the time of reading. Free proxies have a short lifespan and are only suitable for learning purposes. To follow along, grab fresh proxies from the Free Proxy List.
Next, define a proxy in your code pointing to the IP address and port number (e.g., http://8.219.97.248:80
). This ensures that HTTP and HTTPS requests are routed through this proxy.
# define proxies using this syntax:
# <PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>
proxies = {
"https": "http://8.219.97.248:80",
"http": "http://8.219.97.248:80",
}
Finally, pass the proxies
dictionary to the browser object you defined before. Here's what your code should look like:
import mechanicalsoup
# define proxies using this syntax:
# <PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>
proxies = {
"https": "http://8.219.97.248:80",
"http": "http://8.219.97.248:80",
}
# create a browser object
browser = mechanicalsoup.StatefulBrowser()
# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip", proxies=proxies)
print(response.text)
The code will output the IP address of the used proxy server:
{
"origin": "8.219.64.236:60924"
}
Congrats! You've just changed the IP address of your MechanicalSoup scraper. Let's move to more advanced concepts.
Proxy Authentication
Some proxy servers require authentication to grant access only to users with valid credentials. It's usually the case with commercial solutions or premium proxies.
Here's the syntax to specify credentials (username and password) for an authenticated proxy:
<PROXY_PROTOCOL>://<USERNAME>:<PASSWORD>@<PROXY_IP_ADDRESS>:<PROXY_PORT>
This is what your updated code with proxy authentication should look like:
import mechanicalsoup
# define proxies
proxies = {
"https": "http://<YOUR_USERNAME>:<YOUR_PASSWORD>@72.10.160.173:3985",
"http": "http://<YOUR_USERNAME>:<YOUR_PASSWORD>@72.10.160.173:3985",
}
# create a browser object
browser = mechanicalsoup.StatefulBrowser()
# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip", proxies=proxies)
print(response.text)
Add Rotating and Premium Proxies to MechanicalSoup
If you make multiple requests in a short period using a single proxy, the websites you're trying to access can detect this behavior and block you.
To avoid getting blocked, you can use a rotating proxy. This means changing proxies after a certain amount of time or number of requests, making you appear as a different user each time.
Create a list of proxies using the same Free Proxy List website:
# create a list of proxies
PROXIES = [
"http://8.219.97.248:80",
"http://148.72.140.24:30127",
# ...
"http://77.238.235.219:8080"
]
Next, create a function that randomly selects the proxies from the list and returns a dictionary object. You can use the random.choice() method for this.
# ...
import random
# ...
# function to randomly select and return proxies
def rotate_proxy():
https_proxy = random.choice(PROXIES)
http_proxy = random.choice(PROXIES)
return {
"https": https_proxy,
"http": http_proxy
}
# ...
# rotate proxies
proxies = rotate_proxy()
# ...
Here's your final rotating proxy code:
import mechanicalsoup
import random
# create a list of proxies
PROXIES = [
"http://8.219.97.248:80",
"http://148.72.140.24:30127",
# ...
"http://77.238.235.219:8080"
]
# function to randomly select and return proxies
def rotate_proxy():
https_proxy = random.choice(PROXIES)
http_proxy = random.choice(PROXIES)
return {
"https": https_proxy,
"http": http_proxy
}
# create a browser object
browser = mechanicalsoup.StatefulBrowser()
# rotate proxies
proxies = rotate_proxy()
# send a GET request
response = browser.session.request("get", "https://httpbin.io/ip", proxies=proxies)
print(response.text)
Each time you run this code, the script randomly picks a proxy from the list.
# request 1
{
"origin": "8.219.64.236:64632"
}
# request 2
{
"origin": "77.238.235.219:8080"
}
# request 3
{
"origin": "148.72.140.24:30127"
}
Congratulations! You've successfully implemented the rotating proxies functionality.
However, while free proxies work well for the sake of this excercise, it's not advisable to use them in production. Not only are they short-lived, but they also won't protect you from websites’ anti-bot protection systems in most cases.
Let's try to send request a protected website like G2 Reviews using the above approach.
import mechanicalsoup
import random
# create a list of proxies
PROXIES = [
"http://31.186.239.245:8080",
"http://5.78.50.231:8888",
# ...
"http://52.4.247.252:8129"
]
# function to randomly select and return proxies
def rotate_proxy():
https_proxy = random.choice(PROXIES)
http_proxy = random.choice(PROXIES)
return {
"https": https_proxy,
"http": http_proxy
}
# create a browser object
browser = mechanicalsoup.StatefulBrowser()
# rotate proxies
proxies = rotate_proxy()
# send a GET request
response = browser.session.request("get", "https://www.g2.com/products/asana/reviews", proxies=proxies)
print(response.status_code)
The server will respond with 403
status code:
403
This means G2 Reviews detected your rotating proxy request as a bot and responded with a 403 Forbidden
error.
So, what's the solution? Use premium proxies!
Premium proxies provide a complete automated process for anti-bot bypassing. You can scrape more efficiently and use precise geolocation while remaining anonymous. If you need help deciding which service to choose, check out our list of the best premium proxy providers.
Let's use the recommended solution, ZenRows, to access the protected G2 Reviews website that previously blocked us.
To get started with ZenRows, sign up for free. After logging in, you'll get redirected to the Requests Builder page. Paste the G2 Reviews URL in the URL to Scrape box. Enable JS Rendering and click on the Premium Proxies check box. Select Python as your language and click on the Proxy tab. Finally, copy the generated premium proxy.
You need to modify your previous code to integrate ZenRows. Here's the final code to access a protected page using ZenRows premium proxies and MechanicalSoup:
import mechanicalsoup
# paste your generated ZenRows premium proxy here
proxy = "http://<YOUR_ZENROWS_API_KEY>:js_render=true&[email protected]:8001"
proxies = {
"https": proxy,
"http": proxy
}
# create a browser object
browser = mechanicalsoup.StatefulBrowser()
# send a GET request
response = browser.session.request("get", "https://www.g2.com/products/asana/reviews", proxies=proxies, verify=False)
print(response.status_code)
Passing the verify=False
parameter is mandatory when using premium proxies in ZenRows.
When you run this code, you'll get the 200
status code as output.
200
The above output confirms you successfully bypassed anti-bot protection with ZenRows premium proxies. Congratulations!
Conclusion
This step-by-step tutorial showed how to set up a proxy in MechanicalSoup.
Now you know:
- The basics of setting a proxy with MechanicalSoup in Python.
- How to deal with proxy authentication.
- How to use a rotating proxy.
- How to implement a premium proxy and bypass anti-bot systems.
Avoid the hassle of finding and configuring proxies. Using ZenRows, you can bypass any anti-bot protection and increase the reliability of your scraper. Try ZenRows for free!