How to Read robots.txt for Web Scraping
robots.txt is a file that websites use to let web scrapers know if they should crawl a page or not. You should respect that preference since not doing that will get your bot detected easily, or you might face legal consequences.
Let's learn how to read robots.txt while web scraping!
What Is robots.txt in Web Scraping?
The Robots Exclusion Protocol (REP) set robots.txt as an standardised file to indicate which parts of a website are allowed for crawling, and Google popularized it.
Here's an example of robots.txt taken from Yahoo Finance:
User-agent: * Sitemap: https://finance.yahoo.com/sitemap_en-us_desktop_index.xml Sitemap: https://finance.yahoo.com/sitemap_en-us_quotes_index.xml Sitemap: https://finance.yahoo.com/sitemaps/finance-sitemap_index_US_en-US.xml.gz Sitemap: https://finance.yahoo.com/sitemaps/finance-sitemap_googlenewsindex_US_en-US.xml.gz Disallow: /r/ Disallow: /_finance_doubledown/ Disallow: /nel_ms/ Disallow: /caas/ Disallow: /__rapidworker-1.2.js Disallow: /__blank Disallow: /_td_api Disallow: /_remote User-agent:googlebot Disallow: /m/ Disallow: /screener/insider/ Disallow: /caas/ Disallow: /fin_ms/ User-agent:googlebot-news Disallow: /m/ Disallow: /screener/insider/ Disallow: /caas/ Disallow: /fin_ms/
How to Get the robots.txt File From a Website?
You typically retrieve a website's robots.txt by sending an HTTP request to the root of the website's domain and appending
/robots.txt to the end of the URL. For example, to retrieve the rules for
https://www.g2.com/, you'll need to send a request to
You can use tools like cURL or Wget to get the file from the command line. Alternatively, you can do that and read it with Requests and Beautiful Soup libraries in Python.
Remark: If your request fails and returns a
404 not found error, that means the website doesn't have a robots.txt file. Not all sites have such a file.
What Are the Most Common robots.txt Rules?
The robots.txt file indicates one of the following directions for web scraping:
- All pages on the site are crawlable.
- None should be visited.
- Certain sections or files should be left untouched. It can also specify crawl rate limits, visit times, and request rates.
Let's see what instructions you'll find in a robots.txt file.
It determines who's allowed to do web scraping.
The syntax is this:
User-agent: [value] Disallow: [value]
User-agent has a wildcard (
*), that means everyone is allowed to crawl. If containing a specific name, such as
AdsBot-Google, that represents only Google is allowed in this case.
Disallow has no value, all pages are allowed for scraping. If you see
/, that implies every single page is disallowed. In case you'd see a path or file name, such as
/file.html, we're being pointed out what shouldn't be crawled.
An alternative instruction for
Allow, which states the only resources you should visit.
Crawl-delay sets the speed in seconds at which you can scrape each new resource. This helps websites prevent server overload, whose consequence would be slowing down the site for human visitors.
Be careful with this one since not following it might flag you as a malicious scraper and get you blocked easily.
It specifies the hours during which a website can be crawled. The format is
hhmm-hhmm, and the time zone is
In this case, bots are allowed from 02:00 to 12:30 UTC.
It limits the number of simultaneous requests a crawler can make to a website. The format is
x being the number of requests and
y being the time interval in seconds.
1/5 would mean you can only request one page every five seconds.
Other tags, such as Sitemap, tell the crawlers where to find the website's XML sitemap.
Keep in mind that not all websites have all these rules in their robots.txt for web scraping bots, and some may have additional ones.
What Are the Steps to Web Scraping a Website Using robots.txt
Here's what you need to do to respect the robots.txt file for web scraping:
- Retrieve the website's robots.txt by sending an HTTP request to the root of the website's domain and adding
/robots.txtto the end of the URL.
- Parse and analyze the contents of the file to understand the website's crawling rules.
- Check if the website has specified any "Disallow" or "Allow" rules for your user agent.
- Look for any specified crawl-rate limits or visit times that you must abide by.
- Ensure your scraping program adheres to the rules.
- Scrape the website following the rules set in the robots.txt file.
Remark: While website owners use robots.txt to control access to their sites, it might happen your bot is allowed but gets blocked. You need to be aware of CAPTCHAs, IP blocking and other challenges that might stop you unintentionally. To avoid this, check out our article on the best practices while web scraping.
What Are the Pros and Cons of Using a robots.txt File?
To round up our overview of robots.txt files, we'll go over their advantages and drawbacks regarding web scraping.
- robots.txt informs you which pages you can scrape.
- It lets you know if a request rate limit or time frame is set by the website.
- Legal actions may follow if you don't comply with the robots.txt rules.
- Your scraper might get easily blocked if you ignore the file.
As we've seen, reading robots.txt is key to being successful at web scraping and avoiding unnecessary problems. Also, we've learned how to understand the file.
If you still get blocked, you might be facing anti-bot protections. ZenRows is an API you can try for free that will deal with them for you.
Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.