Getting blocked while web scraping can be frustrating, but the most important factor is to change the User Agent in Wget. So, let's learn how to do that.
What Is the Wget User Agent?
The User Agent in Wget is a crucial component of the HTTP headers sent along with every request. These HTTP request headers are metadata that provide additional information to the web server, e.g. to inform on caching behavior, session management, web client capabilities, and so on.
Most importantly, the User Agent (UA) provides details about the web client, such as its name, version, and operating system. Here's a sample Google Chrome browser UA string:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36
It tells the web server that the request comes from a Chrome browser with version 92.0.4515.159
, running on Windows 10, among other details.
However, your default Wget User Agent typically looks like this:
Wget/1.21.4
You can see yours using the following command:
wget --version
From the above UAs, you can understand how easily websites can distinguish between Wget requests and an actual browser. That's why you need to set a custom Wget User Agent.
How Do I Set a Custom User Agent in Wget
Follow the steps below to change your User Agent in Wget.
Step 1: Customize UA
To overwrite Wget's default UA, you must add a --user-agent
or -U
option, followed by a new UA, on a request.
To see it in action, let's use the real sample shown above and target HTTPbin, which displays the used user-agent string.
$ wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36" "https://httpbin.io/user-agent"
Hit enter, and Wget will automatically use the custom User Agent to make the request and retrieve the page content, and save it as user-agent
in your project folder.
The user-agent
file now contains your custom UA in JSON format.
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
}
Congrats, you've successfully changed your Wget User Agent to appear like a Chrome browser.
However, using a single custom UA isn't enough. Keep reading to fix that.
Step 2: Use a Random User Agent in Wget
Randomizing your Wget User-Agent is critical to avoid getting blocked, especially when making many requests. Websites often flag "too many" requests as suspicious activity and can deny you access.
But you can use a random UA per request, as it appears to the web server as though the requests come from different browsers (users).
To get started, create a text file in your project folder containing a list of UAs. We've taken a few from our list of web scraping User Agents.
-
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
-
Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
-
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
Next, use a scripting language, like Python, to select a random UA from the file and pass it to Wget.
For that, ensure you have Python installed and create a .py
file.
Then, in your favorite editor, import the libraries subprocess
and random
. The Subprocess
module is useful for running external commands from within a Python script, while we'll use Random
to select UAs from the list at random. Next, set your Wget User Agent list and read them into a list variable (we've named this user_agents
).
import subprocess
import random
# List of User Agents in a text file (one per line)
user_agent_file = "user_agents.txt"
# Read User Agents from the file
with open(user_agent_file, "r") as file:
user_agents = file.read().splitlines()
Lastly, select a random UA from the list and pass it into the Wget command.
#...
# Choose a random User Agent
random_user_agent = random.choice(user_agents)
# Use wget with the random User Agent
url = "https://httpbin.io/user-agent" # Replace with your URL
command = f'wget --user-agent="{random_user_agent}" {url}'
subprocess.call(command, shell=True)
Putting it all together, here's the complete code:
import subprocess
import random
# List of User Agents in a text file (one per line)
user_agent_file = "user_agents.txt"
# Read User Agents from the file
with open(user_agent_file, "r") as file:
user_agents = file.read().splitlines()
# Choose a random User Agent
random_user_agent = random.choice(user_agents)
# Use wget with the random User Agent
url = "https://httpbin.io/user-agent" # Replace with your URL
command = f'wget --user-agent="{random_user_agent}" {url}'
subprocess.call(command, shell=True)
Run the Python script, and it will make a Wget request using a random UA from your list. Rerun it to make the same request multiple times, and you'll observe that Wget uses different UAs for each request.
Here's our result for three requests:
{
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
Bingo! You've rotated your first UAs with Wget.
To add more UAs to the list, you must ensure they're properly constructed to avoid detection. For example, if your UA claims to be a specific browser (e.g., Mozilla Firefox) with a version number that doesn't match its actual version, it can raise suspicion. Also, keeping your User Agents up-to-date is critical.
That said, building and maintaining a list of adequately constructed UAs can take time and effort. But no worries since the next section shows you the easiest solution.
Best Way to Change Wget User Agents at Scale and Avoid Getting Blocked
To make all your requests with a different and valid Wget User Agent by default, you can perform them via ZenRows' API. Also, it'll handle all other anti-bot measures for you, from CAPTCHAs to IP blocking, rate limiting, and many more.
Let's see ZenRows in action against a G2's product review page, which is well protected.
To use ZenRows with Wget, first sign up to get your free API key.
You'll get to the Request Builder dashboard. There, enter your target URL (https://www.g2.com/products/visual-studio/reviews
), activate the necessary parameters (Premium Proxies and JS Rendering) to build your API URL. Then, select "cURL" (it'll work for Wget).
Copy the API URL and make a Wget request in your command line prompt. The command below will solve all anti-bot challenges and retrieve the HTML content, then save it as index.html
.
wget -O index.html "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fvisual-studio%2Freviews&js_render=true&premium_proxy=true"
Run it, and here's our extract from index.html
:
<!DOCTYPE html><head><meta charset="utf-8" />
#..
<title id="icon-label-81439bdda0646aea926f63677ad59b2e">G2 - Business Software Reviews</title>
Awesome, you can now bypass any anti-bot measure using Wget.
Conclusion
Setting and randomizing User Agents enable you to appear as a regular user to your target web page, increasing your chances of avoiding detection. You must ensure these UAs are properly formed for the best results.
However, it's important to note that changing UAs is just one piece of the puzzle. Numerous web scraping challenges, including browser fingerprinting, CAPTCHAs, and various anti-bot techniques, make it difficult to scrape without getting blocked. Fortunately, ZenRows offers an easy solution for Wget.
Try ZenRows for free to save you frustration dealing with anti-bot measures.