How to Use Wget With a Proxy: Tutorial and Best Practices

September 27, 2024 · 9 min read

Table of contents

What is Wget?
Install Wget
- Install Wget on Linux
- Install Wget on Mac
- Install Wget on Windows
Wget crash course
- Wget syntax
- Download a file with Wget
- Get an output file via Wget
- Download multiple files
- Change user agent
- Extract links from a web page
- Rate limit
How to use Wget with a proxy
- Wget proxy authentication
- Best protocols for Wget proxy
Use rotating proxy with Wget
- Rotate IPs
- Premium proxy
How to fix common errors
- Error 407
- Error 400
Conclusion

Running into IP bans or geo-restrictions while using Wget? Since it's often used for scraping or downloading large files, some sites may flag it as bot activity and block your requests. A simple way around this is to use a proxy server, which can help you stay under the radar and avoid getting blocked.

What Is Wget?

Wget is an open-source command-line utility for automatically retrieving web content over standard internet protocols, including HTTP, HTTPS, FTP and FTPS. It's non-interactive, and you can call its commands from scripts or directly via the terminal. Wget is commonly used in Unix-based operating systems, including Linux and macOS, but you can also install it on Windows.

Install Wget

You can install Wget on major operating systems like Linux, Mac, and Windows. Although you can download Wget from its official website and install it manually, using a package manager is convenient.

Install Wget on Linux

Package managers differ per Linux distribution. If you're using a Debian-based distribution like Ubuntu, install Wget using apt:

                    Terminal
                
sudo apt-get install wget

Copied!

Here's what you need if you're using other popular package managers:

Use the following for YUM:

                    Terminal
                
sudo yum install wget

Copied!

As for ZYpp, you can install it like this:

                    Terminal
                
sudo zypper install wget

Copied!

Install Wget on Mac

On macOS, we recommend using the Homebrew package manager:

                    Terminal
                
brew install wget

Copied!

Install Wget on Windows

For Windows users, an appropriate package manager to use is Chocolatey. Ensure you run the terminal as an administrator for a smooth installation.

                    Terminal
                
choco install wget

Copied!

Now, check to confirm the installation was successful:

                    Terminal
                
wget --version

Copied!

If all goes well, you'll get feedback showing the Wget version installed on your machine. If that isn't the case, re-run the installation command.

Wget Crash Course

Before moving on, let's wrap our hands around the basics of Wget.

Wget Syntax

The output of Wget's help command, wget -h, reveals its syntax:

                    Output
                
wget [OPTION]... [URL]...

Copied!

[OPTIONS] are the various optional flags or parameters to customize Wget's behavior, while [URL] is the URL of the file you want to download.

Here are a few of the most frequently used commands:

-c: resumes a previous paused or interrupted download.
-O <filename>: defines the downloaded file's name.
-r: downloads files repeatedly from the specified URL.
-qO-: suppresses most outputs and displays output directly on the command line.

Now that you know what Wget's syntax looks like, let's proceed.

Download a File With Wget

You can download content from web pages using Wget. Let's say your target website is the Ecommerce demo page. We can get its content by running this command:

                    Terminal
                
wget -qO- https://www.scrapingcourse.com/ecommerce/

Copied!

The q flag is for quiet mode, and O- tells Wget to generate standard output and print to the terminal.

After running the command, you should get the website's full-page HTML:

                    Output
                
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
    <ul class="products columns-4">

        <!--- ... --->

    </ul>
</body>
</html>

  
  

  
Copied!

Bravo! You just extracted your first web content using Wget.

Get an Output File via Wget

How about we save the previous output to a file? Run the previous command without any flags:

                    Terminal
                
wget https://www.scrapingcourse.com/ecommerce/

Copied!

Wget understands the output content type as HTML and automatically saves it to an index.html file in the same directory where the command runs.

Alternatively, you can specify a directory to download the HTML file using the -P flag. Wget will automatically create the specified directory if it doesn't exist. The command below saves the output file in the output_folder directory.

                    Terminal
                
wget -P ./output_folder https://www.scrapingcourse.com/ecommerce/

Copied!

Using the -O option, you can specify a name for the downloaded file instead of the default. The following command downloads the page content into an ecommerce.html file in the same directory.

                    Terminal
                
wget -O ecommerce.html https://www.scrapingcourse.com/ecommerce/

Copied!

You can also save the named output file into a specific folder. The following command stores ecommerce.html into the output_folder directory:

                    Output
                
wget -O ./output_folder/ecommerce.html  https://www.scrapingcourse.com/ecommerce/

Copied!

Download Multiple Files

Wget also lets you download multiple content consecutively from different URLs.

Say you want to download the first two product URLs on the Ecommerce demo page. You can specify them with the wget command. The below command downloads the content of each URL into separate HTML files.

                    Terminal
                
wget https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/

Copied!

What if you want to specify a download path for each URL? You can achieve it with the -O flag for each URL. However, you can't use multiple -O options in a single line.

To apply the O- flag for each URL without errors, separate each command with && like so:

                    Terminal
                
wget -O output1.html https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ && wget -O output2.html https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/

Copied!

The above command saves each URL's content in output1.html and output2.html, respectively.

For a cleaner command, store the links in a txt file in your project directory and download them iteratively. Consider having the following txt file, for example:

                    urls.txt
                
https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
https://www.scrapingcourse.com/ecommerce/product/aeon-capri/

Copied!

The following command downloads the links in urls.txt into separate files.

                    Terminal
                
wget -i urls.txt

Copied!

Change User Agent

Your browser sends a data string known as the User Agent (UA) to the target website's server. This UA string contains information about your browser and operating system. Anti-bot technologies typically examine the UA to differentiate actual browsers from bots.

By customizing the default Wget User Agent with one that looks real, you can reduce the chance of getting blocked. Request the HTTPBin User Agent endpoint to see what Wget's default User Agent looks like:

                    Terminal
                
wget -qO- https://httpbin.io/user-agent

Copied!

You should get the following response:

                    Output
                
{
  "user-agent": "Wget/1.21.4"
}

Copied!

The default value is Wget/1.21.4, which isn't a valid browser User Agent. Websites with anti-bot measures will easily flag you as a bot.

Here's what a valid browser (Chrome) User Agent looks like:

                    Example
                
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36

Copied!

You can set the Wget User Agent by specifying the user_agent option in your .wgetrc configuration file or using the --user-agent=user-agent-string flag in your request.

Grab a User Agent string from our list of top User Agents for web scraping.

To use the --user-agent flag directly via the terminal:

                    Terminal
                
wget -qO- --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" https://httpbin.io/user-agent

Copied!

The above command outputs the following:

                    Output
                
{
  "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}

Copied!

You should prefer setting the User Agent inside the configuration file for convenience. Create a .wgetrc file in your project root directory and add the User Agent string to the file:

                    .wgetrc
                
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"

Copied!

Run your request with the configuration file:

                    Terminal
                
wget --config ./.wgetrc -qO- https://httpbin.io/user-agent

Copied!

You'll get this output, indicating you've changed Wget's User Agent:

                    Output
                
{
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

Copied!

Extract Links From a Web Page

Wget also lets you extract links from a website. This feature is handy for crawling pages recursively during web scraping.

The following command extracts all relevant links from the Ecommerce demo page and stores them in a links.txt file.

                    Terminal
                
wget --spider --recursive --output-file=links.txt https://www.scrapingcourse.com/ecommerce

Copied!

The above extracts all links on the website without applying filters. Say you want to extract only the product links from this website. You'll apply a regex flag pointing to the website's product route before specifying the target URL:

                    Terminal
                
wget --recursive --no-verbose --output-file=links.txt --accept-regex="https://www.scrapingcourse.com/ecommerce/product/" https://www.scrapingcourse.com/ecommerce/

Copied!

The above command will write the product URLs, including the links relevant to that route to the links.txt file:

                    links.txt
                
2024-09-10 16:12:38 URL:https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ [120241] ->
"www.scrapingcourse.com/ecommerce/product/abominable-hoodie/index.html" [1]
2024-09-10 16:12:39 URL:https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/ [119210] ->
"www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/index.html" [1]
// ...

Copied!

Rate Limit

Slowing down your request rate while executing many requests to a single domain is recommended. Rate limiting is essential to avoid getting blocked or overloading the server.

Wget provides two options. These include wait or waitretry to set a delay between requests and limit_rate to limit your download rate. You can add these to your configuration file to apply these settings.

For example, add the following to your .wgetrc file to wait 10 seconds between retries and limit the download speed to 200KB/s:

                    .wgetrc
                
waitretry = 10
limit-rate = 200k

Copied!

Run the wget command with the above configuration:

                    Terminal
                
Wget --config ./.wgetrc -qO- -i urls.txt

Copied!

The above command requests the URLs inside urls.txt while applying the rate limit settings.

Alternatively, you can set the --waitretry and --limit-rate flags directly via the terminal request:

                    Terminal
                
wget --waitretry=10 --limit-rate=200k -qO- -i urls.txt

Copied!

While there's no universal speed limit or request delay that ensures safety, you can adhere to some general guidelines and proven tips to avoid getting blocked when scraping.

Now that you know the fundamentals, let's show you how to set a proxy with Wget.

How to Use Wget With a Proxy From the Command Line

The first step to setting a Wget proxy is to get the proxy you want and set the proxy variables for HTTP and HTTPS in the .wgetrc file that holds the configuration content for Wget.

We'll start by requesting the HTTPBin IP endpoint with Wget to see our machine's IP address:

                    Terminal
                
wget -qO- https://httpbin.io/ip

Copied!

You'll get an output similar to this:

                    Output
                
{
  "origin": "102.80.77.54:43854"
}

Copied!

The value of the origin key is your IP address. Now, let's use a proxy and see if the output (IP address) will change. If you don't yet have a proxy, get one from the Free Proxy List.

Note

Your proxy should be in this format:

<PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>

To use a proxy, Wget checks if the http_proxy, https_proxy, and ftp_proxy variables are set in:

The wgetrc default configuration file.
A custom configuration file passed to it via the --config file.
The proxy environment variable.
The terminal command using the -e flag.

We'll explore option two: creating a configuration file and passing it to Wget.

A Wget configuration file has the syntax below:

                    Example
                
Example
variable = value

Copied!

Create a configuration file (for example, .wgetrc) in your current directory and add the following to it:

                    .wgetrc
                
use_proxy = on
http_proxy = http://15.229.24.5:10470
https_proxy = http://15.229.24.5:10470
ftp_proxy = http://15.229.24.5:10470

Copied!

Remember that the above is a free proxy, which is unreliable due to its short lifespan. So, update the proxy with a fresh one and save it.

Request HTTPBin, but this time, pass the configuration file like this:

                    Terminal
                
wget --config ./.wgetrc -qO- https://httpbin.io/ip

Copied!

This time, you should get your proxy's IP address.

                    Output
                
{
  "origin": "15.229.24.5"
}

Copied!

Wget Proxy Authentication Required: Username and Password

Some proxy servers, such as premium services, require client authentication before granting access. If so, your Wget proxy string will require authentication options to specify a username and password when connecting to the proxy server.

You can achieve this by passing the --proxy-user and --proxy-password options to Wget, including the HTTP and HTTPS proxy protocols for the proxy URL.

                    Terminal
                
wget --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> -e use_proxy=yes -e http_proxy=http://<PROXY_ADDRESS>:<PROXY_PORT> -e https_proxy=https://<PROXY_ADDRESS>:<PROXY_PORT> -qO- https://httpbin.io/ip

Copied!

Alternatively, you can include your authentication credentials in your .wgetrc configuration file like so:

                    .wgetrc
                
use_proxy = on
http_proxy = http://<PROXY_ADDRESS>:<PROXY_PORT>
https_proxy = https://<PROXY_ADDRESS>:<PROXY_PORT>
proxy_user = <YOUR_USERNAME>
proxy_password = <YOUR_PASSWORD>

Copied!

Then, execute your request with the --config flag to use the .wegtrc proxy settings:

                    Terminal
                
wget --config ./.wgetrc -qO- https://httpbin.io/ip

Copied!

The above command will return your authenticated proxy IP address.

We've set the proxies for HTTP and HTTPS protocols. Let's quickly explain the most common ones below.

Best Protocols for a Wget Proxy

The most common proxy types are HTTP, HTTPS, and SOCKS5. HTTP proxies route standard, unencrypted web traffic, while HTTPS proxies provide encrypted communication for secure data transfer.

SOCKS5 is a more versatile proxy that handles various types of traffic, including non-HTTP protocols, making it suitable for tasks like file sharing or email traffic.

Overall, HTTP and HTTPS proxies are suitable for web scraping and crawling, while SOCKS finds applications in tasks involving non-HTTP traffic. Wget has built-in support for HTTP and HTTPS proxies but not SOCKS. Consider using cURL if you want to use a SOCKS proxy. To learn more, check out our article on web scraping with cURL.

Use a Rotating Proxy With Wget

A rotating proxy is a proxy server that constantly changes IP addresses. With each request coming from a different IP, it becomes more difficult for websites to detect and block automated traffic.

Let's set up a rotating proxy using Wget.

Rotate IPs With a Free Solution

A free solution is to create a list of proxies with different IPs and randomly select one per request. Wget doesn't have a standard way of randomly selecting proxies, but you can use a simple shell script.

Grab a few free proxies from the Free Proxies List. Then, create a proxies.txt file and add all the proxies you intend to use. Each Wget proxy should be on a new line.

                    proxies.txt
                
http://113.53.231.133:3129
http://15.229.24.5:10470

Copied!

Create a random_proxies.sh shell script that uses GNU's shuf utility to select a proxy from proxies.txt randomly. Then, set the proxy in Wget using the -e execute option. We've added a loop to execute the Wget command three times to test if the IP changes:

                    random_proxies.sh
                
#!/bin/bash
# loop 3 times to make 3 requests
for i in {1..3}
do
    # pick a random proxy from proxies.txt
    proxy=$(shuf -n 1 proxies.txt)

    # output the selected proxy
    echo "Using proxy: $proxy"

    # run wget using the random proxy
    wget --config ./.wgetrc -e use_proxy=yes -e http_proxy="$proxy" -e https_proxy="$proxy" -e ftp_proxy="$proxy" -qO- https://httpbin.io/ip
done

  
  

  
Copied!

Note

If you're using Windows, you'll need a Unix-like shell environment like Git Bash to run the script above. Once installed, open your project folder via Git Bash.

Run the shell script via your terminal:

                    Terminal
                
./random_proxies.sh

Copied!

You'll discover that the IP randomly changes for every request:

                    Output
                
{
  "origin": "113.53.231.133"
}
{
  "origin": "113.53.231.133"
}
{
  "origin": "15.229.24.5"
}

  
  

  
Copied!

You're now rotating free proxies using a bash script with Wget.

However, this solution is unreliable because free proxies are shared among many users and are short-lived, making them prone to bans. Additionally, the manually created proxy list becomes difficult to maintain at scale, especially when some proxies in the pool start failing.

The best approach is to use residential auto-rotating proxies. We'll explain how to get them in the next section.

Premium Proxy to Avoid Getting Blocked

Free proxies are unreliable for web scraping due to frequent downtime, security risks, and low IP reputation. They often get blocked quickly, making them unsuitable for production use.

Premium proxies provide a more reliable solution for avoiding blocks. They use network provider IPs belonging to daily internet users. Most providers rotate these residential proxies from a pool, ensuring you appear as a different user per request. This feature makes you less detectable, especially during large-scale scraping.

ZenRows' Residential Proxies is a top premium proxy provider offering smart proxy rotation and geo-targeting. It provides 55 million globally distributed residential IPs covering 185+ countries, and with enterprise-grade uptime guarantees, it ensures reliable scraping.

Let's set up ZenRows Residential Proxies with Wget.

First, sign up and access the ZenRows Proxy Generator.

generate residential proxies with zenrows — Click to open the image in full screen

Copy the proxy server details (proxy domain and proxy port) and your proxy credentials (username and password).

Paste this information into your .wgetrc file. Your .wgetrc configuration file should contain your ZenRows proxy details, as shown:

                    .wgetrc
                
use_proxy = on
http_proxy = http://superproxy.zenrows.com:1337
https_proxy = http://superproxy.zenrows.com:1338
proxy_user = <ZENROWS_PROXY_USERNAME>
proxy_password = <ZENROWS_PROXY_PASSWORD>

Copied!

Now, run your Wget request with the above proxy configuration:

                    Terminal
                
wget --config ./.wgetrc -qO- https://httpbin.io/ip

Copied!

See the result for two consecutive requests:

                    Output
                
# request 1
{
    "origin": "95.49.7.50:45176",
}
# request 2
{
    "origin": "189.188.232.238:49067"
}

  
  

  
Copied!

Congratulations! Your Wget scraper now routes requests through ZenRows' residential proxies.

How to Fix Common Errors

When web scraping with Wget, you may encounter several errors:

Error 407: Wget Proxy Authentication Required

Wget's Error 407 means your proxy server requires authentication. Providing valid credentials will fix this issue. The basic authentication credentials include a username and password. You can specify them directly in your command using the --proxy-user and --proxy-password options:

                    Terminal
                
wget --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> -e use_proxy=yes -e http_proxy=http://<PROXY_ADDRESS>:<PROXY_PORT> -e https_proxy=https://<PROXY_ADDRESS>:<PROXY_PORT> -qO- <TARGET_URL>

Copied!

Alternatively, you can set the proxy_user and proxy_password options in your configuration file to run Wget behind a proxy.

Replace <YOUR_USERNAME>, <YOUR_PASSWORD>, and <TARGET_URL> with your proxy server's username, password, and target URL, respectively. If the credentials are correct, you won't get this error again.

Error 400: Wget Proxy Bad Request

Wget's Error 400 usually means the request you sent to your proxy server wasn't correct. You can fix that by verifying your Wget proxy server's settings, like its address, port, and any other configurations available. The error may also appear if there are problems with the target server or if it's protected by anti-bot systems like Cloudflare, which might respond with errors like Error 1020. To confirm the cause, try accessing it directly without a proxy.

Conclusion

When scraping the web, a Wget proxy can help bypass IP blocks and view content restricted in your country. You now know the basics of Wget, how to use a free or a premium proxy with it, and how to authenticate your proxies.

Considering free proxies' unreliability, premium services are the best option. ZenRows provides an effective residential rotating proxy service and features to avoid bot detection.