Does your IP address keep getting banned or geo-restricted while using Wget? Wget is mostly for mirroring websites, scraping data, downloading large files, and backing up web content.
However, some websites can flag you as a bot and eventually block your requests, causing your download to fail continually. So, what to do? A reliable solution is to route your request via a proxy server to avoid bot detection.
In this guide, you'll learn the fundamentals of Wget and how to set a Wget proxy while retrieving content from the web. Let's get to it!
What Is Wget?
Wget is an open-source command-line utility for automatically retrieving web content over standard internet protocols, including HTTP, HTTPS, FTP and FTPS. It's non-interactive, and you can call its commands from scripts or directly via the terminal. Wget is commonly used in Unix-based operating systems, including Linux and macOS, but you can also install it on Windows.
Install Wget
You can install Wget on major operating systems like Linux, Mac, and Windows. Although you can download Wget from its official website and install it manually, using a package manager is convenient.
Install Wget on Linux
Package managers differ per Linux distribution. If you're using a Debian-based distribution like Ubuntu, install Wget using apt
:
sudo apt-get install wget
Here's what you need if you're using other popular package managers:
Use the following for YUM:
sudo yum install wget
As for ZYpp, you can install it like this:
sudo zypper install wget
Install Wget on Mac
On macOS, we recommend using the Homebrew package manager:
brew install wget
Install Wget on Windows
For Windows users, an appropriate package manager to use is Chocolatey. Ensure you run the terminal as an administrator for a smooth installation.
choco install wget
Now, check to confirm the installation was successful:
wget --version
If all goes well, you'll get feedback showing the Wget version installed on your machine. If that isn't the case, re-run the installation command.
Wget Crash Course
Before moving on, let's wrap our hands around the basics of Wget.
Wget Syntax
The output of Wget's help command, wget -h
, reveals its syntax:
wget [OPTION]... [URL]...
[OPTIONS]
are the various optional flags or parameters to customize Wget's behavior, while [URL]
is the URL of the file you want to download.
Here are a few of the most frequently used commands:
-
-c
: resumes a previous paused or interrupted download. -
-O <filename>
: defines the downloaded file's name. -
-r
: downloads files repeatedly from the specified URL. -
-qO-
: suppresses most outputs and displays output directly on the command line.
Now that you know what Wget's syntax looks like, let's proceed.
Download a File With Wget
You can download content from web pages using Wget. Let's say your target website is the Ecommerce demo page. We can get its content by running this command:
wget -qO- https://www.scrapingcourse.com/ecommerce/
The q
flag is for quiet mode, and O-
tells Wget to generate standard output and print to the terminal.
After running the command, you should get the website's full-page HTML:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
<ul class="products columns-4">
<!--- ... --->
</ul>
</body>
</html>
Bravo! You just extracted your first web content using Wget.
Get an Output File via Wget
How about we save the previous output to a file? Run the previous command without any flags:
wget https://www.scrapingcourse.com/ecommerce/
Wget understands the output content type as HTML and automatically saves it to an index.html
file in the same directory where the command runs.
Alternatively, you can specify a directory to download the HTML file using the -P
flag. Wget will automatically create the specified directory if it doesn't exist. The command below saves the output file in the output_folder
directory.
wget -P ./output_folder https://www.scrapingcourse.com/ecommerce/
Using the -O
option, you can specify a name for the downloaded file instead of the default. The following command downloads the page content into an ecommerce.html
file in the same directory.
wget -O ecommerce.html https://www.scrapingcourse.com/ecommerce/
You can also save the named output file into a specific folder. The following command stores ecommerce.html
into the output_folder
directory:
wget -O ./output_folder/ecommerce.html https://www.scrapingcourse.com/ecommerce/
Download Multiple Files
Wget also lets you download multiple content consecutively from different URLs.Â
Say you want to download the first two product URLs on the Ecommerce demo page. You can specify them with the wget
command. The below command downloads the content of each URL into separate HTML files.
wget https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/
What if you want to specify a download path for each URL? You can achieve it with the -O
flag for each URL. However, you can't use multiple -O
options in a single line.
To apply the O-
flag for each URL without errors, separate each command with &&
like so:
wget -O output1.html https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ && wget -O output2.html https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/
The above command saves each URL's content in output1.html
and output2.html,
respectively.
For a cleaner command, store the links in a txt
file in your project directory and download them iteratively. Consider having the following txt
file, for example:
https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
https://www.scrapingcourse.com/ecommerce/product/aeon-capri/
The following command downloads the links in urls.txt
into separate files.
wget -i urls.txt
Change User Agent
Your browser sends a data string known as the User Agent (UA) to the target website's server. This UA string contains information about your browser and operating system. Anti-bot technologies typically examine the UA to differentiate actual browsers from bots.Â
By customizing the default Wget User Agent with one that looks real, you can reduce the chance of getting blocked. Request the HTTPBin User Agent endpoint to see what Wget's default User Agent looks like:
wget -qO- https://httpbin.io/user-agent
You should get the following response:
{
"user-agent": "Wget/1.21.4"
}
The default value is Wget/1.21.4
, which isn't a valid browser User Agent. Websites with anti-bot measures will easily flag you as a bot.
Here's what a valid browser (Chrome) User Agent looks like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
You can set the Wget User Agent by specifying the user_agent
option in your .wgetrc
configuration file or using the --user-agent=user-agent-string
flag in your request.
Grab a User Agent string from our list of top User Agents for web scraping.
To use the --user-agent
flag directly via the terminal:
wget -qO- --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" https://httpbin.io/user-agent
The above command outputs the following:
{
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
You should prefer setting the User Agent inside the configuration file for convenience. Create a .wgetrc
file in your project root directory and add the User Agent string to the file:
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
Run your request with the configuration file:
wget --config ./.wgetrc -qO- https://httpbin.io/user-agent
You'll get this output, indicating you've changed Wget's User Agent:
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
Extract Links From a Web Page
Wget also lets you extract links from a website. This feature is handy for crawling pages recursively during web scraping.Â
The following command extracts all relevant links from the Ecommerce demo page and stores them in a links.txt
file.
wget --spider --recursive --output-file=links.txt https://www.scrapingcourse.com/ecommerce
The above extracts all links on the website without applying filters. Say you want to extract only the product links from this website. You'll apply a regex
flag pointing to the website's product route before specifying the target URL:
wget --recursive --no-verbose --output-file=links.txt --accept-regex="https://www.scrapingcourse.com/ecommerce/product/" https://www.scrapingcourse.com/ecommerce/
The above command will write the product URLs, including the links relevant to that route to the links.txt
file:
2024-09-10 16:12:38 URL:https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/ [120241] ->
"www.scrapingcourse.com/ecommerce/product/abominable-hoodie/index.html" [1]
2024-09-10 16:12:39 URL:https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/ [119210] ->
"www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/index.html" [1]
// ...
Rate Limit
Slowing down your request rate while executing many requests to a single domain is recommended. Rate limiting is essential to avoid getting blocked or overloading the server.
Wget provides two options. These include wait
or waitretry
to set a delay between requests and limit_rate
to limit your download rate. You can add these to your configuration file to apply these settings.Â
For example, add the following to your .wgetrc
file to wait 10 seconds between retries and limit the download speed to 200KB/s:
waitretry = 10
limit-rate = 200k
Run the wget
command with the above configuration:
Wget --config ./.wgetrc -qO- -i urls.txt
The above command requests the URLs inside urls.txt
while applying the rate limit settings.Â
Alternatively, you can set the --waitretry
and --limit-rate
flags directly via the terminal request:
wget --waitretry=10 --limit-rate=200k -qO- -i urls.txt
While there's no universal speed limit or request delay that ensures safety, you can adhere to some general guidelines and proven tips to avoid getting blocked when scraping.
Now that you know the fundamentals, let's show you how to set a proxy with Wget.
How to Use Wget With a Proxy From the Command Line
The first step to setting a Wget proxy is to get the proxy you want and set the proxy variables for HTTP and HTTPS in the .wgetrc
file that holds the configuration content for Wget.
We'll start by requesting the HTTPBin IP endpoint with Wget to see our machine's IP address:
wget -qO- https://httpbin.io/ip
You'll get an output similar to this:
{
"origin": "102.80.77.54:43854"
}
The value of the origin key is your IP address. Now, let's use a proxy and see if the output (IP address) will change. If you don't yet have a proxy, get one from the Free Proxy List.
Your proxy should be in this format:
<PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>
To use a proxy, Wget checks if the http_proxy
, https_proxy
, and ftp_proxy
variables are set in:
- The
wgetrc
default configuration file. - A custom configuration file passed to it via the
--config
file. - The proxy environment variable.
- The terminal command using the -e flag.
We'll explore option two: creating a configuration file and passing it to Wget.
A Wget configuration file has the syntax below:
Example
variable = value
Create a configuration file (for example, .wgetrc
) in your current directory and add the following to it:
use_proxy = on
http_proxy = http://15.229.24.5:10470
https_proxy = http://15.229.24.5:10470
ftp_proxy = http://15.229.24.5:10470
Remember that the above is a free proxy, which is unreliable due to its short lifespan. So, update the proxy with a fresh one and save it.Â
Request HTTPBin, but this time, pass the configuration file like this:
wget --config ./.wgetrc -qO- https://httpbin.io/ip
This time, you should get your proxy's IP address.
{
"origin": "15.229.24.5"
}
Wget Proxy Authentication Required: Username and Password
Some proxy servers, such as premium services, require client authentication before granting access. If so, your Wget proxy string will require authentication options to specify a username and password when connecting to the proxy server.Â
You can achieve this by passing the --proxy-user
and --proxy-password
options to Wget, including the HTTP and HTTPS proxy protocols for the proxy URL.
wget --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> -e use_proxy=yes -e http_proxy=http://<PROXY_ADDRESS>:<PROXY_PORT> -e https_proxy=https://<PROXY_ADDRESS>:<PROXY_PORT> -qO- https://httpbin.io/ip
Alternatively, you can include your authentication credentials in your .wgetrc
configuration file like so:
use_proxy = on
http_proxy = http://<PROXY_ADDRESS>:<PROXY_PORT>
https_proxy = https://<PROXY_ADDRESS>:<PROXY_PORT>
proxy_user = <YOUR_USERNAME>
proxy_password = <YOUR_PASSWORD>
Then, execute your request with the --config
flag to use the .wegtrc
proxy settings:
wget --config ./.wgetrc -qO- https://httpbin.io/ip
The above command will return your authenticated proxy IP address.
We've set the proxies for HTTP and HTTPS protocols. Let's quickly explain the most common ones below.
Best Protocols for a Wget Proxy
The most common proxy types are HTTP, HTTPS, and SOCKS5. HTTP proxies route standard, unencrypted web traffic, while HTTPS proxies provide encrypted communication for secure data transfer.Â
SOCKS5 is a more versatile proxy that handles various types of traffic, including non-HTTP protocols, making it suitable for tasks like file sharing or email traffic.
Overall, HTTP and HTTPS proxies are suitable for web scraping and crawling, while SOCKS finds applications in tasks involving non-HTTP traffic. Wget has built-in support for HTTP and HTTPS proxies but not SOCKS. Consider using cURL if you want to use a SOCKS proxy. To learn more, check out our article on web scraping with cURL.
Use a Rotating Proxy With Wget
A rotating proxy is a proxy server that constantly changes IP addresses. With each request coming from a different IP, it becomes more difficult for websites to detect and block automated traffic.
Let's set up a rotating proxy using Wget.
Rotate IPs With a Free Solution
A free solution is to create a list of proxies with different IPs and randomly select one per request. Wget doesn't have a standard way of randomly selecting proxies, but you can use a simple shell script.
Grab a few free proxies from the Free Proxies List. Then, create a proxies.txt
file and add all the proxies you intend to use. Each Wget proxy should be on a new line.
http://113.53.231.133:3129
http://15.229.24.5:10470
Create a random_proxies.sh
shell script that uses GNU's shuf
utility to select a proxy from proxies.txt
randomly. Then, set the proxy in Wget using the -e
execute option. We've added a loop to execute the Wget command three times to test if the IP changes:
#!/bin/bash
# loop 3 times to make 3 requests
for i in {1..3}
do
# pick a random proxy from proxies.txt
proxy=$(shuf -n 1 proxies.txt)
# output the selected proxy
echo "Using proxy: $proxy"
# run wget using the random proxy
wget --config ./.wgetrc -e use_proxy=yes -e http_proxy="$proxy" -e https_proxy="$proxy" -e ftp_proxy="$proxy" -qO- https://httpbin.io/ip
done
If you're using Windows, you'll need a Unix-like shell environment like Git Bash to run the script above. Once installed, open your project folder via Git Bash.
Run the shell script via your terminal:
./random_proxies.sh
You'll discover that the IP randomly changes for every request:
{
"origin": "113.53.231.133"
}
{
"origin": "113.53.231.133"
}
{
"origin": "15.229.24.5"
}
You're now rotating free proxies using a bash script with Wget.Â
However, this solution is unreliable because free proxies are shared among many users and are short-lived, making them prone to bans. Additionally, the manually created proxy list becomes difficult to maintain at scale, especially when some proxies in the pool start failing.Â
The best approach is to use residential auto-rotating proxies. We'll explain how to get them in the next section.
Premium Proxy to Avoid Getting Blocked
Premium residential proxies use network provider IPs belonging to daily internet users. Most providers rotate these residential proxies from a pool, ensuring you appear as a different user per request. This feature makes you less detectable, especially during large-scale scraping.
ZenRows is a top residential proxy provider offering quality proxy rotation and geo-targeting with flexible pricing. It provides 55 million globally distributed residential IPs covering 185+ countries.
To use ZenRows, sign up to open the Request Builder.
Go to the Proxy Generator and copy the proxy domain information and your proxy credentials (username and password). Paste this information into your .wgetrc
file.
Your .wgetrc
configuration file should contain your ZenRows proxy details, as shown:
use_proxy = on
http_proxy = http://superproxy.zenrows.com:1337
https_proxy = http://superproxy.zenrows.com:1338
proxy_user = <ZENROWS_PROXY_USERNAME>
proxy_password = <ZENROWS_PROXY_PASSWORD>
Now, run your Wget request with the above proxy configuration:
wget --config ./.wgetrc -qO- https://httpbin.io/ip
See the result for three consecutive requests:
# request 1
{
"origin": "95.49.7.50:45176",
}
# request 2
{
"origin": "189.188.232.238:49067"
}
# request 3
{
"origin": "109.196.227.119:55700"
}
Your Wget scraper now routes requests through ZenRows' residential proxies.Â
However, proxies might be insufficient when dealing with advanced anti-bot measures. The good news is that ZenRows also offers a web scraping API to bypass anti-bots at the same price as the proxy service. You can completely replace your Wget scraper with this scraping API and forget about getting blocked. Try the ZenRows scraper API for free.
How to Fix Common Errors
When web scraping with Wget, you may encounter several errors:
Error 407: Wget Proxy Authentication Required
Wget's Error 407 means your proxy server requires authentication. Providing valid credentials will fix this issue. The basic authentication credentials include a username and password. You can specify them directly in your command using the --proxy-user
and --proxy-password
options:
wget --proxy-user=<YOUR_USERNAME> --proxy-password=<YOUR_PASSWORD> -e use_proxy=yes -e http_proxy=http://<PROXY_ADDRESS>:<PROXY_PORT> -e https_proxy=https://<PROXY_ADDRESS>:<PROXY_PORT> -qO- <TARGET_URL>
Alternatively, you can set the proxy_user
and proxy_password
options in your configuration file to run Wget behind a proxy.
Replace <YOUR_USERNAME>
, <YOUR_PASSWORD>,
and <TARGET_URL>
with your proxy server's username, password, and target URL, respectively. If the credentials are correct, you won't get this error again.
Error 400: Wget Proxy Bad Request
Wget's Error 400 usually means the request you sent to your proxy server wasn't correct. You can fix that by verifying your Wget proxy server's settings, like its address, port, and any other configurations available. The error may also appear if there are problems with the target server or if it's protected by anti-bot systems like Cloudflare, which might respond with errors like Error 1020. To confirm the cause, try accessing it directly without a proxy.
Conclusion
When scraping the web, a Wget proxy can help bypass IP blocks and view content restricted in your country. You now know the basics of Wget, how to use a free or a premium proxy with it, and how to authenticate your proxies.
Considering free proxies' unreliability, premium services are the best option. ZenRows provides an effective residential rotating proxy service and features to avoid bot detection.Â
Sign up for free now without a credit card.