How to Use Wget With a Proxy: Tutorial and Best Practices

Yuvraj Chandra
Yuvraj Chandra
September 27, 2024 · 9 min read

Does your IP address keep getting banned or geo-restricted while using Wget? Wget is mostly for mirroring websites, scraping data, downloading large files, and backing up web content.

However, some websites can flag you as a bot and eventually block your requests, causing your download to fail continually. So, what to do? A reliable solution is to route your request via a proxy server to avoid bot detection.

In this guide, you'll learn the fundamentals of Wget and how to set a Wget proxy while retrieving content from the web. Let's get to it!

What Is Wget?

Wget is an open-source command-line utility for automatically retrieving web content over standard internet protocols, including HTTP, HTTPS, FTP and FTPS. It's non-interactive, and you can call its commands from scripts or directly via the terminal. Wget is commonly used in Unix-based operating systems, including Linux and macOS, but you can also install it on Windows.

Install Wget

You can install Wget on major operating systems like Linux, Mac, and Windows. Although you can download Wget from its official website and install it manually, using a package manager is convenient.

Install Wget on Linux

Package managers differ per Linux distribution. If you're using a Debian-based distribution like Ubuntu, install Wget using apt:

Terminal
sudo apt-get install wget

Here's what you need if you're using other popular package managers:

Use the following for YUM:

Terminal
sudo yum install wget

As for ZYpp, you can install it like this:

Terminal
sudo zypper install wget

Install Wget on Mac

On macOS, we recommend using the Homebrew package manager:

Terminal
brew install wget

Install Wget on Windows

For Windows users, an appropriate package manager to use is Chocolatey. Ensure you run the terminal as an administrator for a smooth installation.

Terminal
choco install wget

Now, check to confirm the installation was successful:

Terminal
wget --version

If all goes well, you'll get feedback showing the Wget version installed on your machine. If that isn't the case, re-run the installation command.

Wget Crash Course

Before moving on, let's wrap our hands around the basics of Wget.

Wget Syntax

The output of Wget's help command, wget -h, reveals its syntax:

Output
wget [OPTION]... [URL]...

[OPTIONS] are the various optional flags or parameters to customize Wget's behavior, while [URL] is the URL of the file you want to download.

Here are a few of the most frequently used commands:

  • -c: resumes a previous paused or interrupted download.
  • -O <filename>: defines the downloaded file's name.
  • -r: downloads files repeatedly from the specified URL.
  • -qO-: suppresses most outputs and displays output directly on the command line.

Now that you know what Wget's syntax looks like, let's proceed.

Download a File With Wget

You can download content from web pages using Wget. Let's say your target website is the Ecommerce demo page. We can get its content by running this command:

Terminal
wget -qO- https://www.scrapingcourse.com/ecommerce/

The q flag is for quiet mode, and O- tells Wget to generate standard output and print to the terminal.

After running the command, you should get the website's full-page HTML:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
    <ul class="products columns-4">

        <!--- ... --->

    </ul>
</body>
</html>

Bravo! You just extracted your first web content using Wget.

Get an Output File via Wget

How about we save the previous output to a file? Run the previous command without any flags:

Terminal
wget https://www.scrapingcourse.com/ecommerce/

Wget understands the output content type as HTML and automatically saves it to an index.html file in the same directory where the command runs.

Alternatively, you can specify a directory to download the HTML file using the -P flag. Wget will automatically create the specified directory if it doesn't exist. The command below saves the output file in the output_folder directory.

Terminal
wget -P ./output_folder https://www.scrapingcourse.com/ecommerce/

Using the -O option, you can specify a name for the downloaded file instead of the default. The following command downloads the page content into an ecommerce.html file in the same directory.

Terminal
wget -O ecommerce.html https://www.scrapingcourse.com/ecommerce/

You can also save the named output file into a specific folder. The following command stores ecommerce.html into the output_folder directory:

Output
wget -O ./output_folder/ecommerce.html  https://www.scrapingcourse.com/ecommerce/

Download Multiple Files

Wget also lets you download multiple content consecutively from different URLs. 

Say you want to download the first two product URLs on the Ecommerce demo page. You can specify them with the wget command. The below command downloads the content of each URL into separate HTML files.

Terminal
wget https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/

What if you want to specify a download path for each URL? You can achieve it with the -O flag for each URL. However, you can't use multiple -O options in a single line.

To apply the O- flag for each URL without errors, separate each command with && like so:

Terminal
wget -O output1.html https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ && wget -O output2.html https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/

The above command saves each URL's content in output1.html and output2.html, respectively.

For a cleaner command, store the links in a txt file in your project directory and download them iteratively. Consider having the following txt file, for example:

urls.txt
https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
https://www.scrapingcourse.com/ecommerce/product/aeon-capri/

The following command downloads the links in urls.txt into separate files.

Terminal
wget -i urls.txt

Change User Agent

Your browser sends a data string known as the User Agent (UA) to the target website's server. This UA string contains information about your browser and operating system. Anti-bot technologies typically examine the UA to differentiate actual browsers from bots. 

By customizing the default Wget User Agent with one that looks real, you can reduce the chance of getting blocked. Request the HTTPBin User Agent endpoint to see what Wget's default User Agent looks like:

Terminal
wget -qO- https://httpbin.io/user-agent

You should get the following response:

Output
{
  "user-agent": "Wget/1.21.4"
}

The default value is Wget/1.21.4, which isn't a valid browser User Agent. Websites with anti-bot measures will easily flag you as a bot.

Here's what a valid browser (Chrome) User Agent looks like:

Example
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36

You can set the Wget User Agent by specifying the user_agent option in your .wgetrc configuration file or using the --user-agent=user-agent-string flag in your request.

Grab a User Agent string from our list of top User Agents for web scraping.

To use the --user-agent flag directly via the terminal:

Terminal
wget -qO- --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" https://httpbin.io/user-agent

The above command outputs the following:

Output
{
  "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}

You should prefer setting the User Agent inside the configuration file for convenience. Create a .wgetrc file in your project root directory and add the User Agent string to the file:

.wgetrc
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"

Run your request with the configuration file:

Terminal
wget --config ./.wgetrc -qO- https://httpbin.io/user-agent

You'll get this output, indicating you've changed Wget's User Agent:

Output
{
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

Wget also lets you extract links from a website. This feature is handy for crawling pages recursively during web scraping. 

The following command extracts all relevant links from the Ecommerce demo page and stores them in a links.txt file.

Terminal
wget --spider --recursive --output-file=links.txt https://www.scrapingcourse.com/ecommerce

The above extracts all links on the website without applying filters. Say you want to extract only the product links from this website. You'll apply a regex flag pointing to the website's product route before specifying the target URL:

Terminal
wget --recursive --no-verbose --output-file=links.txt --accept-regex="https://www.scrapingcourse.com/ecommerce/product/" https://www.scrapingcourse.com/ecommerce/

The above command will write the product URLs, including the links relevant to that route to the links.txt file:

links.txt
2024-09-10 16:12:38 URL:https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ [120241] ->
"www.scrapingcourse.com/ecommerce/product/abominable-hoodie/index.html" [1]
2024-09-10 16:12:39 URL:https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/ [119210] ->
"www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/index.html" [1]
// ...

Rate Limit

Slowing down your request rate while executing many requests to a single domain is recommended. Rate limiting is essential to avoid getting blocked or overloading the server.

Wget provides two options. These include wait or waitretry to set a delay between requests and limit_rate to limit your download rate. You can add these to your configuration file to apply these settings. 

For example, add the following to your .wgetrc file to wait 10 seconds between retries and limit the download speed to 200KB/s:

.wgetrc
waitretry = 10
limit-rate = 200k

Run the wget command with the above configuration:

Terminal
Wget --config ./.wgetrc -qO- -i urls.txt

The above command requests the URLs inside urls.txt while applying the rate limit settings. 

Alternatively, you can set the --waitretry and --limit-rate flags directly via the terminal request:

Terminal
wget --waitretry=10 --limit-rate=200k -qO- -i urls.txt

While there's no universal speed limit or request delay that ensures safety, you can adhere to some general guidelines and proven tips to avoid getting blocked when scraping.

Now that you know the fundamentals, let's show you how to set a proxy with Wget.

How to Use Wget With a Proxy From the Command Line

The first step to setting a Wget proxy is to get the proxy you want and set the proxy variables for HTTP and HTTPS in the .wgetrc file that holds the configuration content for Wget.

We'll start by requesting the HTTPBin IP endpoint with Wget to see our machine's IP address:

Terminal
wget -qO- https://httpbin.io/ip

You'll get an output similar to this:

Output
{
  "origin": "102.80.77.54:43854"
}

The value of the origin key is your IP address. Now, let's use a proxy and see if the output (IP address) will change. If you don't yet have a proxy, get one from the Free Proxy List.

To use a proxy, Wget checks if the http_proxy, https_proxy, and ftp_proxy variables are set in:

  1. The wgetrc default configuration file.
  2. A custom configuration file passed to it via the --config file.
  3. The proxy environment variable.
  4. The terminal command using the -e flag.

We'll explore option two: creating a configuration file and passing it to Wget.

A Wget configuration file has the syntax below:

Example
Example
variable = value

Create a configuration file (for example, .wgetrc) in your current directory and add the following to it:

.wgetrc
use_proxy = on
http_proxy = http://15.229.24.5:10470
https_proxy = http://15.229.24.5:10470
ftp_proxy = http://15.229.24.5:10470

Remember that the above is a free proxy, which is unreliable due to its short lifespan. So, update the proxy with a fresh one and save it. 

Request HTTPBin, but this time, pass the configuration file like this:

Terminal
wget --config ./.wgetrc -qO- https://httpbin.io/ip

This time, you should get your proxy's IP address.

Output
{
  "origin": "15.229.24.5"
}

Wget Proxy Authentication Required: Username and Password

Some proxy servers, such as premium services, require client authentication before granting access. If so, your Wget proxy string will require authentication options to specify a username and password when connecting to the proxy server. 

You can achieve this by passing the --proxy-user and --proxy-password options to Wget, including the HTTP and HTTPS proxy protocols for the proxy URL.

Terminal
wget --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> -e use_proxy=yes -e http_proxy=http://<PROXY_ADDRESS>:<PROXY_PORT> -e https_proxy=https://<PROXY_ADDRESS>:<PROXY_PORT> -qO- https://httpbin.io/ip

Alternatively, you can include your authentication credentials in your .wgetrc configuration file like so:

.wgetrc
use_proxy = on
http_proxy = http://<PROXY_ADDRESS>:<PROXY_PORT>
https_proxy = https://<PROXY_ADDRESS>:<PROXY_PORT>
proxy_user = <YOUR_USERNAME>
proxy_password = <YOUR_PASSWORD>

Then, execute your request with the --config flag to use the .wegtrc proxy settings:

Terminal
wget --config ./.wgetrc -qO- https://httpbin.io/ip

The above command will return your authenticated proxy IP address.

We've set the proxies for HTTP and HTTPS protocols. Let's quickly explain the most common ones below.

Best Protocols for a Wget Proxy

The most common proxy types are HTTP, HTTPS, and SOCKS5. HTTP proxies route standard, unencrypted web traffic, while HTTPS proxies provide encrypted communication for secure data transfer. 

SOCKS5 is a more versatile proxy that handles various types of traffic, including non-HTTP protocols, making it suitable for tasks like file sharing or email traffic.

Overall, HTTP and HTTPS proxies are suitable for web scraping and crawling, while SOCKS finds applications in tasks involving non-HTTP traffic. Wget has built-in support for HTTP and HTTPS proxies but not SOCKS. Consider using cURL if you want to use a SOCKS proxy. To learn more, check out our article on web scraping with cURL.

Use a Rotating Proxy With Wget

A rotating proxy is a proxy server that constantly changes IP addresses. With each request coming from a different IP, it becomes more difficult for websites to detect and block automated traffic.

Let's set up a rotating proxy using Wget.

Rotate IPs With a Free Solution

A free solution is to create a list of proxies with different IPs and randomly select one per request. Wget doesn't have a standard way of randomly selecting proxies, but you can use a simple shell script.

Grab a few free proxies from the Free Proxies List. Then, create a proxies.txt file and add all the proxies you intend to use. Each Wget proxy should be on a new line.

proxies.txt
http://113.53.231.133:3129
http://15.229.24.5:10470

Create a random_proxies.sh shell script that uses GNU's shuf utility to select a proxy from proxies.txt randomly. Then, set the proxy in Wget using the -e execute option. We've added a loop to execute the Wget command three times to test if the IP changes:

random_proxies.sh
#!/bin/bash
# loop 3 times to make 3 requests
for i in {1..3}
do
    # pick a random proxy from proxies.txt
    proxy=$(shuf -n 1 proxies.txt)

    # output the selected proxy
    echo "Using proxy: $proxy"

    # run wget using the random proxy
    wget --config ./.wgetrc -e use_proxy=yes -e http_proxy="$proxy" -e https_proxy="$proxy" -e ftp_proxy="$proxy" -qO- https://httpbin.io/ip
done

Run the shell script via your terminal:

Terminal
./random_proxies.sh

You'll discover that the IP randomly changes for every request:

Output
{
  "origin": "113.53.231.133"
}
{
  "origin": "113.53.231.133"
}
{
  "origin": "15.229.24.5"
}

You're now rotating free proxies using a bash script with Wget. 

However, this solution is unreliable because free proxies are shared among many users and are short-lived, making them prone to bans. Additionally, the manually created proxy list becomes difficult to maintain at scale, especially when some proxies in the pool start failing. 

The best approach is to use residential auto-rotating proxies. We'll explain how to get them in the next section.

Premium Proxy to Avoid Getting Blocked

Premium residential proxies use network provider IPs belonging to daily internet users. Most providers rotate these residential proxies from a pool, ensuring you appear as a different user per request. This feature makes you less detectable, especially during large-scale scraping.

ZenRows is a top residential proxy provider offering quality proxy rotation and geo-targeting with flexible pricing. It provides 55 million globally distributed residential IPs covering 185+ countries.

To use ZenRows, sign up to open the Request Builder.

Go to the Proxy Generator and copy the proxy domain information and your proxy credentials (username and password). Paste this information into your .wgetrc file.

generate residential proxies with zenrows
Click to open the image in full screen

Your .wgetrc configuration file should contain your ZenRows proxy details, as shown:

.wgetrc
use_proxy = on
http_proxy = http://superproxy.zenrows.com:1337
https_proxy = http://superproxy.zenrows.com:1338
proxy_user = <ZENROWS_PROXY_USERNAME>
proxy_password = <ZENROWS_PROXY_PASSWORD>

Now, run your Wget request with the above proxy configuration:

Terminal
wget --config ./.wgetrc -qO- https://httpbin.io/ip

See the result for three consecutive requests:

Output
# request 1
{
    "origin": "95.49.7.50:45176",
}
# request 2
{
    "origin": "189.188.232.238:49067"
}
# request 3
{
    "origin": "109.196.227.119:55700"
}

Your Wget scraper now routes requests through ZenRows' residential proxies. 

However, proxies might be insufficient when dealing with advanced anti-bot measures. The good news is that ZenRows also offers a web scraping API to bypass anti-bots at the same price as the proxy service. You can completely replace your Wget scraper with this scraping API and forget about getting blocked. Try the ZenRows scraper API for free.

How to Fix Common Errors

When web scraping with Wget, you may encounter several errors:

Error 407: Wget Proxy Authentication Required

Wget's Error 407 means your proxy server requires authentication. Providing valid credentials will fix this issue. The basic authentication credentials include a username and password. You can specify them directly in your command using the --proxy-user and --proxy-password options:

Terminal
wget --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> -e use_proxy=yes -e http_proxy=http://<PROXY_ADDRESS>:<PROXY_PORT> -e https_proxy=https://<PROXY_ADDRESS>:<PROXY_PORT> -qO- <TARGET_URL>

Alternatively, you can set the proxy_user and proxy_password options in your configuration file to run Wget behind a proxy.

Replace <YOUR_USERNAME>, <YOUR_PASSWORD>, and <TARGET_URL> with your proxy server's username, password, and target URL, respectively. If the credentials are correct, you won't get this error again.

Error 400: Wget Proxy Bad Request

Wget's Error 400 usually means the request you sent to your proxy server wasn't correct. You can fix that by verifying your Wget proxy server's settings, like its address, port, and any other configurations available. The error may also appear if there are problems with the target server or if it's protected by anti-bot systems like Cloudflare, which might respond with errors like Error 1020. To confirm the cause, try accessing it directly without a proxy.

Conclusion

When scraping the web, a Wget proxy can help bypass IP blocks and view content restricted in your country. You now know the basics of Wget, how to use a free or a premium proxy with it, and how to authenticate your proxies.

Considering free proxies' unreliability, premium services are the best option. ZenRows provides an effective residential rotating proxy service and features to avoid bot detection. 

Sign up for free now without a credit card.

Ready to get started?

Up to 1,000 URLs for free are waiting for you