robots.txt is a file that websites use to let web scrapers know if they should crawl a page or not. You should respect that preference since not doing that will get your bot detected easily, or you might face legal consequences.
Let's learn how to read robots.txt while web scraping!
What Is robots.txt in Web Scraping?
The Robots Exclusion Protocol (REP) set robots.txt as an standardised file to indicate which parts of a website are allowed for crawling, and Google popularized it.
Here's an example of robots.txt taken from Yahoo Finance:
User-agent: *
Sitemap: https://finance.yahoo.com/sitemap_en-us_desktop_index.xml
Sitemap: https://finance.yahoo.com/sitemap_en-us_quotes_index.xml
Sitemap: https://finance.yahoo.com/sitemaps/finance-sitemap_index_US_en-US.xml.gz
Sitemap: https://finance.yahoo.com/sitemaps/finance-sitemap_googlenewsindex_US_en-US.xml.gz
Disallow: /r/
Disallow: /_finance_doubledown/
Disallow: /nel_ms/
Disallow: /caas/
Disallow: /__rapidworker-1.2.js
Disallow: /__blank
Disallow: /_td_api
Disallow: /_remote
User-agent:googlebot
Disallow: /m/
Disallow: /screener/insider/
Disallow: /caas/
Disallow: /fin_ms/
User-agent:googlebot-news
Disallow: /m/
Disallow: /screener/insider/
Disallow: /caas/
Disallow: /fin_ms/
How to Get the robots.txt File From a Website?
You typically retrieve a website's robots.txt by sending an HTTP request to the root of the website's domain and appending /robots.txt
to the end of the URL. For example, to retrieve the rules for https://www.g2.com/
, you'll need to send a request to https://www.g2.com/robots.txt
.
You can use tools like cURL or Wget to get the file from the command line. Alternatively, you can do that and read it with Requests and Beautiful Soup libraries in Python.
Remark: If your request fails and returns a 404 not found
error, that means the website doesn't have a robots.txt file. Not all sites have such a file.
What Are the Most Common robots.txt Rules?
The robots.txt file indicates one of the following directions for web scraping:
- All pages on the site are crawlable.
- None should be visited.
- Certain sections or files should be left untouched. It can also specify crawl rate limits, visit times, and request rates.
Let's see what instructions you'll find in a robots.txt file.
User-agent
It determines who's allowed to do web scraping.
The syntax is this:
User-agent: [value]
Disallow: [value]
If User-agent
has a wildcard (*
), that means everyone is allowed to crawl. If containing a specific name, such as AdsBot-Google
, that represents only Google is allowed in this case.
When Disallow
has no value, all pages are allowed for scraping. If you see /
, that implies every single page is disallowed. In case you'd see a path or file name, such as /folder/
or /file.html
, we're being pointed out what shouldn't be crawled.
An alternative instruction for Disallow
is Allow
, which states the only resources you should visit.
Crawl-delay
Crawl-delay
sets the speed in seconds at which you can scrape each new resource. This helps websites prevent server overload, whose consequence would be slowing down the site for human visitors.
Crawl-delay: 7
Be careful with this one since not following it might flag you as a malicious scraper and get you blocked easily.
Visit-time
It specifies the hours during which a website can be crawled. The format is hhmm-hhmm
, and the time zone is UTC
.
Visit-time: 0200-1230
In this case, bots are allowed from 02:00 to 12:30 UTC.
Request-rate
It limits the number of simultaneous requests a crawler can make to a website. The format is x/y
, with x
being the number of requests and y
being the time interval in seconds.
Visit-time: 1/5
For example, 1/5
would mean you can only request one page every five seconds.
Sitemap
Other tags, such as Sitemap, tell the crawlers where to find the website's XML sitemap.
Keep in mind that not all websites have all these rules in their robots.txt for web scraping bots, and some may have additional ones.
What Are the Steps to Web Scraping a Website Using robots.txt
Here's what you need to do to respect the robots.txt file for web scraping:
- Retrieve the website's robots.txt by sending an HTTP request to the root of the website's domain and adding
/robots.txt
to the end of the URL. - Parse and analyze the contents of the file to understand the website's crawling rules.
- Check if the website has specified any "Disallow" or "Allow" rules for your user agent.
- Look for any specified crawl-rate limits or visit times that you must abide by.
- Ensure your scraping program adheres to the rules.
- Scrape the website following the rules set in the robots.txt file.
Remark: While website owners use robots.txt to control access to their sites, it might happen your bot is allowed but gets blocked. You need to be aware of CAPTCHAs, IP blocking and other challenges that might stop you unintentionally. To avoid this, check out our article on the best practices while web scraping.
What Are the Pros and Cons of Using a robots.txt File?
To round up our overview of robots.txt files, we'll go over their advantages and drawbacks regarding web scraping.
👍 Pros:
- robots.txt informs you which pages you can scrape.
- It lets you know if a request rate limit or time frame is set by the website.
👎 Cons:
- Legal actions may follow if you don't comply with the robots.txt rules.
- Your scraper might get easily blocked if you ignore the file.
Conclusion
As we've seen, reading robots.txt is key to being successful at web scraping and avoiding unnecessary problems. Also, we've learned how to understand the file.
If you still get blocked, you might be facing anti-bot protections. ZenRows is an API you can try for free that will deal with them for you.