Wget is a free GNU command-line utility for retrieving content via HTTP, HTTPS, and FTP. It's mostly used for mirroring websites, downloading large files, and backing up web content.
However, some websites can flag you as a bot and eventually block your requests, causing your download to fail continually. So, what to do? A reliable solution is to route your request via a proxy server to avoid bot detection.
In this guide, you'll learn how to use a Wget proxy and the best practices and protocols for web scraping. Let's get to it!
What Is a Wget Proxy?
A Wget proxy is a server that lets users access website content without directly connecting to them. It acts as a middleman between the user and the target server, helping to improve privacy and security.
Namely, when you make a request, it's rerouted through the proxy server first. After that, the proxy server sends the request to the website, receives the response, and returns it to you.
Wget Crash Course
Let's cover some fundamentals of Wget before moving on.
Install Wget
Start by installing Wget on a local machine. You can do so on major operating systems like Linux, Mac, and Windows. Although Wget can be downloaded from its official website and installed manually, using a package manager is convenient.
Install Wget on Linux
Different categories of Linux distribution use different package managers.
If you're using a Debian-based distribution like Ubuntu, install using apt
:
sudo apt-get install wget
Here's what you need if you're using other popular package managers:
Use the following for YUM:
sudo yum install wget
As for ZYpp, you can install it like this:
sudo zypper install wget
Now, check to confirm the installation was successful:
wget --version
If all goes well, you'll get feedback showing the Wget version installed on your machine. If that isn't the case, re-run the installation command.
Install Wget on Mac
On MacOS, we recommend using the Homebrew package manager:
brew install wget
Install Wget on Windows
For Windows users, an appropriate package manager to use is Chocolatey:
choco install wget
Wget Syntax
The output of Wget's help command, wget -h
, reveals its syntax:
wget [OPTION]... [URL]...
[OPTIONS]
are the various optional flags or parameters that can be used to customize the behavior of Wget, while [URL]
is the URL of the file to be downloaded.ย
All available options or flags are found using the Wget help command above. Here are a few of the most frequently used:
-
-c
resumes a previous paused or interrupted download. -
-O <filename
: defines the downloaded file's name. -
-r
downloads files repeatedly from the specified URL.
Now that you know what the syntax of Wget looks like, let's proceed.
Download a File with Wget
You can download content from webpages using Wget. Let's say your target website is IdentMe. We can get its content by running this:
wget -qO- https://ident.me/
The q
flag is for quiet mode, and O-
tells wget
to generate standard output and print to the terminal.
After running the command, you should get your IP address (output), which is the content of the IdentMe page:
197.210.7...
Get an Output File via Wget
How about we save the output from IdentMe to a file? Run the previous command without any flags:
wget https://ident.me/
Wget understands the output content type as HTML and automatically saves it to an index.html
file in the same directory where the command was run.
Alternatively, you can specify a directory where the HTML file is to be downloaded by using the -P
flag:
wget -P ./save_here https://ident.me/
Wget will automatically create the specified directory if it doesn't exist. The above command will download the output and save the file in the save_here
directory.
You can also specify a name for the downloaded file instead of the default using the -O
option. That will download the content into a text.txt
file in the same directory.
wget -O 'text.txt' https://ident.me/
How to Use Wget with a Proxy from the Command Line
Now, let's see how to use a proxy with Wget. The first step is to get the proxy you want and set the proxy variables for HTTP and HTTPS in the wgetrc
file that holds the configuration content for Wget.
Let's start by making a request to HTTPBin with Wget to see the IP address of our machine:
wget -qO- https://httpbin.org/ip
You'll get an output similar to this:
{
"origin": "197.210.7..."
}
The value of the origin
key is your IP address. Now, let's use a proxy and see if the output (IP address) will change. If you don't have a proxy available, get one from the free Proxy Server List.
Your proxy should be in this format:
<PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>
To use a proxy, Wget checks if the http_proxy
, https_proxy
, and ftp_proxy
variables are set in:
-
wgetrc
default configuration file, located in/usr/local/etc/wgetrc
. -
.wgetrc
configuration file, located in$HOME/.wgetrc
. - Configuration file passed to it via the
--config
file. - Set a proxy environment variable.
- Executed command using the
-e
flag.
We'll explore option three: creating a configuration file and passing it to Wget.
A Wget configuration file has the syntax below:
variable = value
Create a configuration file (for example, .wgetrc
) in your current directory and add the following to it:
use_proxy = on
http_proxy = http://15.229.24.5:10470
https_proxy = http://15.229.24.5:10470
ftp_proxy = http://15.229.24.5:10470
Update the proxy with a fresh one and save it. Make the request to HTTPBin, but this time pass the configuration file like this:
wget --config ./.wgetrc -qO- https://httpbin.org/ip
This time, you should use a different IP address: the address of your proxy.
{
"origin": "15.229.24.5"
}
Wget Proxy Authentication Required: Username and Password
It's common practice for some proxy servers to require client authentication before granting access, especially when dealing with premium services. If that's the case, your Wget proxy string will need authentication options to specify a username and password when connecting to the proxy server. That can be done by passing the --proxy-user
and --proxy-password
options to wget
:
wget --config ./.wgetrc --proxy-user <YOUR_USERNAME> --proxy-password <YOUR_PASSWORD> -qO- https://httpbin.org/ip
Alternatively, you can include your authentication credentials in the proxy string:
<PROXY_PROTOCOL>://<YOUR_USERNAME>:<YOUR_PASSWORD>@<PROXY_IP_ADDRESS>:<PROXY_PORT>
Use a Rotating Proxy with Wget
A rotating proxy is a proxy server that constantly changes IP addresses. With each request coming from a different IP, it becomes more difficult for websites to detect and block automated traffic.
Let's set up a rotating proxy using Wget.
Rotate IPs with a Free Solution
A free solution is to create a list of proxies with different IPs and randomly select and use one. Wget doesn't have a standard way of randomly selecting proxies, but you can use a simple shell script.
Create a proxies.txt
file and add all the proxies you intend to use. Each Wget proxy should be on a new line.
http://113.53.231.133:3129
http://15.229.24.5:10470
Create a shell script that uses GNU's shuf
utility to randomly select a proxy from proxies.txt
and set the proxy in Wget using the -e
execute option:
for i in {1..3}
do
proxy=$(shuf -n 1 proxies.txt) # Pick random proxy from proxies
wget --config ./.wgetrc -qO- -e use_proxy=yes -e http_proxy=$proxy -e https_proxy=$proxy -e ftp_proxy=$proxy --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> https://httpbin.org/ip
done
Here, we've added a loop to execute the Wget command three times to test if the IP changes. Paste the script in your terminal and run it. You'll discover that the IP randomly changes for every request:
{
"origin": "113.53.231.133"
}
{
"origin": "113.53.231.133"
}
{
"origin": "15.229.24.5"
}
This free solution implementation aims to show you the basics of IP rotation, but it's unreliable. You'll likely get blocked due to the lack of a sufficient list of active proxies, mostly if they aren't residential ones. To prove this, let's use this method to scrape data from the G2's Homepage:
for i in {1..3}
do
proxy=$(shuf -n 1 proxies.txt) # Pick random proxy from proxies
wget --config ./.wgetrc -qO- -e use_proxy=yes -e http_proxy=$proxy -e https_proxy=$proxy -e ftp_proxy=$proxy --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> https://g2.com
done
Running this script will yield an inappropriate output or no output at all because G2 has blocked all the requests made to it.
Let's see a better alternative in the next section.
Premium Proxy to Avoid Getting Blocked
A premium proxy service is the best method to avoid being blocked while ensuring greater stability and faster connection speeds. ZenRows is an excellent option because it offers a residential proxy service with advanced tools to bypass anti-bot measures. With ZenRows, you get premium proxy rotation, geotargeting, scraper API, and more. It also offers a unified pricing and charges only for successful requests.
Let's see how to use ZenRows' Wget proxy to scrape G2's Homepage that blocked us before. Get your ZenRows API key and up to 1,000 free URLs by signing up for a new account. You'll get access to the easy-to-navigate dashboard, like in the below image:
Update your .wgetrc
config file with the following:
use_proxy = on
check_certificate = off
http_proxy = http://<YOUR_ZENROWS_API_KEY>:js_render=true&premium_proxy=true@api.zenrows.com:8001
https_proxy = http://<YOUR_ZENROWS_API_KEY>:js_render=true&premium_proxy=true@api.zenrows.com:8001
The check_certificate
option is turned off here to prevent SSL certificate errors. Copy the API key from your dashboard and replace <YOUR_ZENROWS_API_KEY>
in the configuration with it.
Now, make a new request to G2 with this command:
wget --config ./.wgetrc https://g2.com
You should get a successful response, such as:
Awesome! By default, Wget will save the response in an index.html
file, seeing that the request data content type is text/html
. You can open the file to view your scraped data.
It's worth noting that aside from rotating proxies, ZenRows also rotates User Agents, making the scraping process more effective.
Best Practices for Wget Scraping
Let's see the best practices you can employ to avoid getting blocked while web scraping with Wget.
Set a Real User Agent for Wget Scraping
Your browser sends a string of data known as the User Agent (UA) to the target website's server, which contains information about what browser and operating system you're using. Anti-bot technologies typically examine the UA to differentiate actual browsers from bots.ย
By customizing the default Wget User Agent with one that looks real, you can reduce the chance of getting blocked. Make a request to the HTTPBin headers endpoint to see what Wget's default UA looks like:
wget -qO- https://httpbin.org/headers
You should get a response similar to this:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "identity",
"Host": "httpbin.org",
"User-Agent": "Wget/1.21.3",
"X-Amzn-Trace-Id": "Root=1-64685ea4-60786d911f48365f63075608"
}
}
The default value is Wget/1.21.3
, which isn't a valid browser UA. Therefore, websites with anti-bot measures will easily flag you as a bot.ย
Here's what a valid browser (Chrome) User Agent looks like:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
_Italic_You can grab some from our list of top User Agents for web scraping. Now, you can set the Wget user agent by specifying the user_agent
option in your configuration file or using the --user-agent=agent-string
flag in your request.ย
Update your .wgetrc
config file like this:
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
Update your request to include the configuration and run it:
wget --config ./.wgetrc -qO- https://httpbin.org/headers
You should get a similar response as before, but this time with the UA you set:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "identity",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-64686fe3-14aa555967dcfdf972c3e336"
}
}
Rate Limit with Wget to Avoid Being Blocked
Slowing down your request rate is recommended to avoid getting blocked or overloading the server. Wget provides two options, wait
or waitretry
and limit_rate
, that can be added to your configuration file to set a delay between requests and limit the download speed, respectively.ย
Alternatively, you can use the command line equivalents: --wait
or --wait-entry
and --limit-rate
.
For example, --wait=3
sets a three-second delay, while --limit-rate=60k
restricts the download speed to 60 KB/s. While there's no universal speed limit or request delay that ensures safety, you can adhere to some general guidelines and proven tips to avoid being blocked when scraping.
Rate limiting isn't a problem when using a web scraping API like ZenRows because of its large proxy pool.
How to Fix Common Errors
When web scraping with Wget, you may encounter several errors:
Error 407: Wget Proxy Authentication Required
Wget's Error 407 means your proxy server requires authentication, so providing valid credentials will fix the issue. The basic authentication credentials include a username and password. You can specify them directly in your command using the --proxy-user
and --proxy-password
options:
wget --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> <TARGET_URL>
Alternatively, you can use the proxy_user
and proxy_password
options in your configuration file to use Wget behind a proxy. Replace <YOUR_USERNAME>
, <YOUR_PASSWORD>
, and <TARGET_URL>
with your proxy server's username, password, and target URL, respectively. If the credentials are correct, you won't get this error again.
Error 400: Wget Proxy Bad Request
Wget's Error 400 usually means the request you sent to your proxy server wasn't correct. You can fix that by verifying your Wget proxy server's settings, like its address, port, and any other configurations available. The error may also appear if there are problems with the target server. To confirm, try accessing it directly without a proxy.
Conclusion
When scraping the web, a Wget proxy can help bypass IP blocks and view content restricted in your country. You know now:
- The basics of Wget.
- How to use a proxy with Wget and authenticate it.ย
- How to use free and premium proxy rotation solutions.
- Best practices for web scraping with Wget.
Considering the unreliability of free proxies, using premium ones is the best course of action. ZenRows provides an effective rotating proxy service and features to avoid bot detection. Sign up and get 1,000 free credits to try it for yourself.