Headless browser scraping is one of the best methods for crawling data from web pages. The usual web scraping requires you to run your code inside a browser, which makes the scraping process inconvenient since you have to run it inside an environment that already provides a graphic interface.
The browser will require time and resources to render the actual web page you're trying to scrap, making the process slow. If your project involves basic data extraction, then you might be able to do it with basic methods. This is where headless browser web scraping comes in.
In this guide, we'll be discussing what headless browsers are, the benefits of headless scraping and the best options available.
Let's get started!
What Is Headless Browser Scraping?
Headless browser scraping is the practice of web scraping but with a headless browser. It means scraping a web page without an actual user interface. For example, this happens when you try scraping through a normal web browser:
So what if you use a headless crawler? Well you are literally skipping the rendering step:
While you might get different results depending on the headless browser you're using, looking at the whole process high-level, that's what happens when you go headless.
For this purpose, you can definitely use most programming languages such as Node.js, PHP, JAVA, Ruby, Python and others. The only requirement for any of these languages is that there is at least a library/package that allows you to interact with a headless browser.
Is Headless Scraping Faster?
Yes, it is, because it requires fewer resources and fewer steps to get the information needed.
When you use a headless browser, you're skipping the entire rendering of the UI.
We can use Puppeteer (an automation tool that uses a Chromium-based browser) to check the performance tab inside the DevTools and compare the results of a page load configured to avoid loading any images and CSS styles, versus a normal page load.
We'll use the eBay site, which relies heavily on images. That is a perfect site for optimization.
Look at that! 2 seconds off when we load the page without images and CSS styles.
The time spent painting the page is also fewer because while we still have to show something, that "something" is a lot less complex.
Think about a more realistic scenario, let's say you have 100 clients, and each of them is doing 100 scraping requests per day. So you have 10.000 requests per day where you're saving 2 seconds on average, which translates to 5 (almost 6) hours saved every day, only through not having to render all those resources.
Is that number big enough for you now?
Can Headless Browsers Be Detected?
Just because you can scrap a website using the latest technologies, it doesn't mean you should. Web scraping can be seen as bad and some developers go the extra mile to block and avoid crawling their web content.
That said, headless browsers can be detected and here are techniques that developers use to detect headless browser scraping activities:
1. Request frequency
The request frequency is a clear indicator. This takes us back to the previous point about the performance being a double edge blade.
While it's a great thing to be able to send more requests at the same time, a website that doesn't want to be scraped will quickly detect that a single IP is sending too many requests per second and block the requests in a split second.
So what can you do to avoid it? If you're coding your own scraper, you can try to throttle the number of requests that you send per second, especially if you're sending them all through the same IP. That way you can try and simulate that you're a real user.
How many requests can you send per second? That's up to the website to limit and you to find out through trial and error.
2. IP filtering
Another very common way developers use to determine if you're a real user or a bot trying to scrap their website is to keep an updated blacklist of IPs that they can't trust. These IPs aren't trusted because they've detected scraping activities originating from them in the past.
Although bypassing IP filtering shouldn't be a problem, ZenRows offers excellent premium proxies for web scraping.
They are used by developers to filter bots. These little guys will pose a simple problem that humans can easily solve while machines would require a bit of work to do. It's a simple test and, while it can also be solved by computers, it also requires you to work a little harder.
There are a number of ways to bypass CAPTCHAs but one of the easiest we have found is by making use of ZenRows' API.
4. User Agent detection
All browsers send a special header called "User-Agent" where they add an identifying string that determines which browser is used and its version. If a website is trying to protect itself from crawlers and scrapers, it will look for that header and check if the value is correct or not.
For example, this would be the user agent sent by a normal Google Chrome browser: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/220.127.116.11 Safari/537.36"
You can see the string specifies the browser, its version and even the OS that it's using.
Now look at a normal Headless Chrome user agent (Headless Chrome is a headless version of Chrome browser):
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/76.0.3803.0 Safari/537.36
You can also see similar information telling us that it's on Linux instead of Windows, but it's also adding the HeadlessChrome version. That's an immediate sign we're not real human beings. Busted!
Of course, as with most request headers, we can also fake it through whatever tool we're using to scrap the website. But if you forget about it, then you'll run into a target that's filtering by UA and you'll get blocked.
5. Browser fingerprinting
Browser fingerprinting is a technique that involves gathering different characteristics of your system and making a unique hash out of them all.
There is no single way of doing this but, if done correctly, they can identify you and your browser even if you try to mask your identity. One of the common techniques involve canvas fingerprinting, where they draw an image in the background and measure the distinct distortions caused by your particular setup (graphics card, browser version etc).
Or reading the list of media devices installed in your system, that unique combination will give your identity away in future sessions. Others will even use sound to get a fingerprint, like how it's possible to use the Web Audio API to measure specific distortions in a sound wave generated by an oscillator.
After all, if you're trying to scrape a website that has many of these limitations, then you're probably serious about it. If that's your case, consider investing some money and letting the pros handle all these problems for you.
Which Browser Is Best for Web Scraping?
Is there a "best headless browser for scraping" out there? No, there isn't. After all, the concept of "best" is only valid within the context of the problem you're trying to solve.
That being said, there are some popular alternatives out there that might be right for you, at least as a starting point. Some of the most popular browsers for headless web scraping are:
ZenRows is an API with an integrated headless browser out of the box. It allows both static and dynamic data scraping with a single API call.
You can integrate with all languages, and it also offers SDKs for Python and Node.js. For testing, you can start with a graphical interface or cURL requests. Then scale with your preferred language.
You could say Puppeteer is a Node headless browser. It has a great API and it's relatively easy to use, with a few lines of code you can start your scraping activities.
It's widely used for headless testing, and it provides a very intuitive API for specifying user actions on the scraping target.
HTMLUnit is a great option for Java developers. This headless browser is used by many popular projects and it's actively maintained.
What Are the Downsides of Headless Browsers for Web Scraping?
If you're web scraping using a headless browser for fun, then chances are nobody will care about your scraping activities. But if you're doing large-scale web scraping, you'll start making enough noise and risk getting detected. The following are some of the downsides of web scraping with a headless browser:
Headless browsers are harder to debug
Scraping is basically extracting data from a source and to achieve this, we have to reference parts of the DOM, like capturing a certain class name or looking for an input with a specific name.
When the website's structure changes (i.e its HTML changes), our code is then rendered useless. That happens because our logic can't adapt as we would through visual inspection.
This is all to say that if you're building a scraper, and data suddenly starts being wrong or empty (because the HTML of the scraped page changed), you'll have to manually review and debug the code to understand what's happening.
So you have to resort either to workarounds or to using a browser that supports both versions like Puppeteer, so you can switch the UI on and off when you need it.
They do have a significant learning curve
Browsing a website through code requires you to see the website differently. We're used to browsing a website using visual cues or descriptions (in the case of visual aid assistants).
But now you have to look at the website's code and understand its architecture to properly get the information you want, in a way that is mostly change-resistant to avoid updating the browsing logic every few weeks.
What Are the Benefits of Headless Scraping?
The benefits of using a headless browser for scraping include:
It's possible to automate tasks during headless browser scraping, making it a real time-saver, especially when the website you're trying to crawl isn't really interested in protecting itself or in changing its internal architecture too often. That way you won't be bothered with constant updates to the browsing logic.
2. Increase in speed
The increase in speed is considerable since you'll utilize fewer resources per website and the loading time for each one can also be reduced, making it a huge time-saver over time.
3. Delivery of structured data from a seemingly unstructured source
Websites can seem unstructured because, after all, they're designed to be read by people, and people have no problem with gathering information from unstructured sources. But because websites do have an internal architecture, we can leverage it and take the information we want.
We can then save that information in a machine-readable form (like JSON, YAML or any other data storage format) for later processing.
4. Potential savings in bandwidth feeds
Headless browsing can be done in a way that skips some of the biggest resources (Kb-wise) of a web page. That translates directly to a speed increase but it also allows us to save a lot of data from being transferred to the server where the scraping is taking place.
And that can mean a considerably lower data-transfer bill from services like proxies or gateways from cloud providers that charge per transferred byte.
5. Scraping dynamic content
Several of the above also apply to faster tools like requests in Python. But those lack an important feature that headless browsers have: extracting data from dynamic pages or SPAs (Single Page Applications).
You can wait for some content to appear, and even interact with the page. Navigate or fill in forms, the options are almost unlimited. After all, you are interacting with a browser.
Web scraping in general is a great tool to capture data from multiple web sources and centralize it wherever you need. It's a cost-effective way of crawling because you can work once and automate the next set of executions at a very low price.
Headless web scraping is a way to perform scraping with a special version of a browser that has no UI, which makes it even faster and cheaper to run.
In this guide, you've learned the basics of headless browser web scraping, including the types, benefits, downsides and some tools.
Or you could use an all-in-one API-based solution such as ZenRows for smooth web scraping. It provides access to premium proxies, anti-bot protection and CAPTCHA bypass systems. Try ZenRows for free.