The Anti-bot Solution to Scrape Everything? Get Your Free API Key! 😎

How to Use Wget with a Proxy: Steps & Best Practices

June 1, 2023 · 11 min read

Wget is a free GNU command-line utility for retrieving content via HTTP, HTTPS, and FTP. It's mostly used for mirroring websites, downloading large files, and backing up web content.

However, some websites can flag you as a bot and eventually block your requests, causing your download to fail continually. So, what to do? A reliable solution is to route your request via a proxy server to avoid bot detection.

In this guide, you'll learn how to use a Wget proxy and the best practices and protocols for web scraping. Let's get to it!

What Is a Wget Proxy?

A Wget proxy is a server that lets users access website content without directly connecting to them. It acts as a middleman between the user and the target server, helping to improve privacy and security.

Namely, when you make a request, it's rerouted through the proxy server first. After that, the proxy server sends the request to the website, receives the response, and returns it to you.

Wget Crash Course

Let's cover some fundamentals of Wget before moving on.

Install Wget

Start by installing Wget on a local machine. You can do so on major operating systems like Linux, Mac, and Windows. Although Wget can be downloaded from its official website and installed manually, using a package manager is convenient.

Install Wget on Linux

Different categories of Linux distribution use different package managers.

If you're using a Debian-based distribution like Ubuntu, install using apt:

Terminal
sudo apt-get install wget

Here's what you need if you're using other popular package managers:

Use the following for YUM:

Terminal
sudo yum install wget

As for ZYpp, you can install it like this:

File
sudo zypper install wget

Now, check to confirm the installation was successful:

Terminal
wget --version

If all goes well, you'll get feedback showing the Wget version installed on your machine. If that isn't the case, re-run the installation command.

Install Wget on Mac

On MacOS, we recommend using the Homebrew package manager:

Terminal
brew install wget

Bold#### Install Wget on Windows

For Windows users, an appropriate package manager to use is Chocolatey:

Terminal
choco install wget

Wget Syntax

The output of Wget's help command, wget -h, reveals its syntax:

Output
wget [OPTION]... [URL]...

[OPTIONS] are the various optional flags or parameters that can be used to customize the behavior of Wget, while [URL] is the URL of the file to be downloaded. 

All available options or flags are found using the Wget help command above. Here are a few of the most frequently used:

  • -c resumes a previous paused or interrupted download.
  • -O <filename: defines the downloaded file's name.
  • -r downloads files repeatedly from the specified URL.

Now that you know what the syntax of Wget looks like, let's proceed.

Download a File with Wget

You can download content from webpages using Wget. Let's say your target website is IdentMe. We can get its content by running this:

Terminal
wget -qO-  https://ident.me/

The q flag is for quiet mode, and O- tells wget to generate standard output and print to the terminal.

After running the command, you should get your IP address (output), which is the content of the IdentMe page:

Output
197.210.7...

Get an Output File via Wget

How about we save the output from IdentMe to a file? Run the previous command without any flags:

Terminal
wget https://ident.me/                                                                                                                             

Wget understands the output content type as HTML and automatically saves it to an index.html file in the same directory where the command was run.

Alternatively, you can specify a directory where the HTML file is to be downloaded by using the -P flag:

Terminal
wget -P ./save_here https://ident.me/

Wget will automatically create the specified directory if it doesn't exist. The above command will download the output and save the file in the save_here directory.

You can also specify a name for the downloaded file instead of the default using the -O option. That will download the content into a text.txt file in the same directory.

Terminal
wget -O 'text.txt' https://ident.me/ 
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Use Wget with a Proxy from the Command Line

Now, let's see how to use a proxy with Wget. The first step is to get the proxy you want and set the proxy variables for HTTP and HTTPS in the wgetrc file that holds the configuration content for Wget.

Let's start by making a request to HTTPBin with Wget to see the IP address of our machine:

Terminal
wget -qO- https://httpbin.org/ip

You'll get an output similar to this:

Output
{
  "origin": "197.210.7..."
}

The value of the origin key is your IP address. Now, let's use a proxy and see if the output (IP address) will change. If you don't have a proxy available, get one from the free Proxy Server List.

To use a proxy, Wget checks if the http_proxy, https_proxy, and ftp_proxy variables are set in:

  1. wgetrc default configuration file, located in /usr/local/etc/wgetrc.
  2. .wgetrc configuration file, located in $HOME/.wgetrc.
  3. Configuration file passed to it via the --config file.
  4. Set a proxy environment variable.
  5. Executed command using the -e flag.

We'll explore option three: creating a configuration file and passing it to Wget.

A Wget configuration file has the syntax below:

Example
variable = value

Create a configuration file (for example, .wgetrc) in your current directory and add the following to it:

Terminal
use_proxy = on
http_proxy = http://15.229.24.5:10470
https_proxy = http://15.229.24.5:10470
ftp_proxy = http://15.229.24.5:10470

Update the proxy with a fresh one and save it. Make the request to HTTPBin, but this time pass the configuration file like this:

Terminal
wget --config ./.wgetrc -qO- https://httpbin.org/ip

This time, you should use a different IP address: the address of your proxy.

Output
{
  "origin": "15.229.24.5"
}

Wget Proxy Authentication Required: Username and Password

It's common practice for some proxy servers to require client authentication before granting access, especially when dealing with premium services. If that's the case, your Wget proxy string will need authentication options to specify a username and password when connecting to the proxy server. That can be done by passing the --proxy-user and --proxy-password options to wget:

Terminal
wget --config ./.wgetrc --proxy-user <USERNAME> --proxy-password <PASSWORD>  -qO- https://httpbin.org/ip

Alternatively, you can include your authentication credentials in the proxy string:

Terminal
<PROXY_PROTOCOL>://<USERNAME>:<PASSWORD>@<PROXY_IP_ADDRESS>:<PROXY_PORT>

Use a Rotating Proxy with Wget

A rotating proxy is a proxy server that constantly changes IP addresses. With each request coming from a different IP, it becomes more difficult for websites to detect and block automated traffic.

Let's set up a rotating proxy using Wget.

Rotate IPs with a Free Solution

A free solution is to create a list of proxies with different IPs and randomly select and use one. Wget doesn't have a standard way of randomly selecting proxies, but you can use a simple shell script.

Create a proxies.txt file and add all the proxies you intend to use. Each Wget proxy should be on a new line.

proxies.txt
http://113.53.231.133:3129
http://15.229.24.5:10470

Create a shell script that uses GNU's shuf utility to randomly select a proxy from proxies.txt and set the proxy in Wget using the -e execute option:

proxies.txt
for i in {1..3}
do
    proxy=$(shuf -n 1 proxies.txt) # Pick random proxy from proxies
    wget --config ./.wgetrc -qO- -e use_proxy=yes -e http_proxy=$proxy -e https_proxy=$proxy -e ftp_proxy=$proxy --proxy-user=your_username --proxy-password=PASSWORD https://httpbin.org/ip
done

Here, we've added a loop to execute the Wget command three times to test if the IP changes. Paste the script in your terminal and run it. You'll discover that the IP randomly changes for every request:

Output
{
  "origin": "113.53.231.133"
}
{
  "origin": "113.53.231.133"
}
{
  "origin": "15.229.24.5"
}

This free solution implementation aims to show you the basics of IP rotation, but it's unreliable. You'll likely get blocked due to the lack of a sufficient list of active proxies, mostly if they aren't residential ones. To prove this, let's use this method to scrape data from the G2's Homepage:

proxies.txt
for i in {1..3}
do
    proxy=$(shuf -n 1 proxies.txt) # Pick random proxy from proxies
    wget --config ./.wgetrc -qO- -e use_proxy=yes -e http_proxy=$proxy -e https_proxy=$proxy -e ftp_proxy=$proxy --proxy-user=your_username --proxy-password=PASSWORD https://g2.com
done

Running this script will yield an inappropriate output or no output at all because G2 has blocked all the requests made to it.

Wget G2 Response
Click to open the image in full screen

Let's see a better alternative in the next section.

Premium Proxy to Avoid Getting Blocked

A premium proxy service is the best method to avoid being blocked while ensuring greater stability and faster connection speeds. ZenRows is an excellent option because it offers diverse and effective features. You'll get geotargeting, flexible pricing starting at $49/mo, get charged only for successful requests, and premium proxy rotation.

Let's see how to use ZenRows' Wget proxy to scrape G2's Homepage that blocked us before. Get your ZenRows API key and 1,000 free credits by signing up for a new account. You'll get access to the easy-to-navigate dashboard, like in the below image:

ZenRows Dashboard
Click to open the image in full screen

Update your .wgetrc config file with the following:

.wgetrc
use_proxy = on
check_certificate = off

http_proxy = http://<YOUR_ZENROWS_API_KEY>:antibot=true&[email protected]:8001
https_proxy = http://<YOUR_ZENROWS_API_KEY>:antibot=true&[email protected]:8001

The check_certificate option is turned off here to prevent SSL certificate errors. Copy the API key from your dashboard and replace <YOUR_ZENROWS_API_KEY> in the configuration with it.

Now, make a new request to G2 with this command:

Terminal
wget --config ./.wgetrc https://g2.com

You should get a successful response, such as:

Wget G2
Click to open the image in full screen

Awesome! By default, Wget will save the response in an index.html file, seeing that the request data content type is text/html. You can open the file to view your scraped data.

It's worth noting that aside from rotating proxies, ZenRows also rotates User Agents, making the scraping process more effective.

Best Practices for Wget Scraping

Let's see the best practices you can employ to avoid getting blocked while web scraping with Wget.

Set a Real User Agent for Wget Scraping

Your browser sends a string of data known as the User Agent (UA) to the target website's server, which contains information about what browser and operating system you're using. Anti-bot technologies typically examine the UA to differentiate actual browsers from bots. 

By customizing the default Wget User Agent with one that looks real, you can reduce the chance of getting blocked. Make a request to the HTTPBin headers endpoint to see what Wget's default UA looks like:

Terminal
wget -qO- https://httpbin.org/headers

You should get a response similar to this:

Output
{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "identity", 
    "Host": "httpbin.org", 
    "User-Agent": "Wget/1.21.3", 
    "X-Amzn-Trace-Id": "Root=1-64685ea4-60786d911f48365f63075608"
  }
}

The default value is Wget/1.21.3, which isn't a valid browser UA. Therefore, websites with anti-bot measures will easily flag you as a bot. 

Here's what a valid browser (Chrome) User Agent looks like:

Example
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36

_Italic_You can grab some from our list of top User Agents for web scraping. Now, you can set the Wget user agent by specifying the user_agent option in your configuration file or using the --user-agent=agent-string flag in your request. 

Update your .wgetrc config file like this:

.wgetrc
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"

Update your request to include the configuration and run it:

File
wget --config ./.wgetrc -qO- https://httpbin.org/headers

You should get a similar response as before, but this time with the UA you set:

Output
{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "identity", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-64686fe3-14aa555967dcfdf972c3e336"
  }
}

Rate Limit with Wget to Avoid Being Blocked

Slowing down your request rate is recommended to avoid getting blocked or overloading the server. Wget provides two options, wait or waitretry and limit_rate, that can be added to your configuration file to set a delay between requests and limit the download speed, respectively. 

Alternatively, you can use the command line equivalents: --wait or --wait-entry and --limit-rate.

For example, --wait=3 sets a three-second delay, while --limit-rate=60k restricts the download speed to 60 KB/s. While there's no universal speed limit or request delay that ensures safety, you can adhere to some general guidelines and proven tips to avoid being blocked when scraping.

How to Fix Common Errors

When web scraping with Wget, you may encounter several errors:

Error 407: Wget Proxy Authentication Required

Wget's Error 407 means your proxy server requires authentication, so providing valid credentials will fix the issue. The basic authentication credentials include a username and password. You can specify them directly in your command using the --proxy-user and --proxy-password options:

Terminal
wget --proxy-user=<USERNAME> --proxy-password=<PASSWORD> <TARGET_URL>

Alternatively, you can use the proxy_user and proxy_password options in your configuration file to use Wget behind a proxy. Replace <USERNAME>, <PASSWORD>, and <TARGET_URL> with your proxy server's username, password, and target URL, respectively. If the credentials are correct, you won't get this error again.

Error 400: Wget Proxy Bad Request

Wget's Error 400 usually means the request you sent to your proxy server wasn't correct. You can fix that by verifying your Wget proxy server's settings, like its address, port, and any other configurations available. The error may also appear if there are problems with the target server. To confirm, try accessing it directly without a proxy.

Conclusion

When scraping the web, a Wget proxy can help bypass IP blocks and view content restricted in your country. You know now:

  • The basics of Wget.
  • How to use a proxy with Wget and authenticate it. 
  • How to use free and premium proxy rotation solutions.
  • Best practices for web scraping with Wget.

Considering the unreliability of free proxies, using premium ones is the best course of action. ZenRows provides an effective rotating proxy service and features to avoid bot detection. Sign up and get 1,000 free credits to try it for yourself.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.