Avoiding blocks is an essential part of website scraping, so we will also add some features to help in that regard. And finally, parallelize the tasks to go faster thanks to Node's event loop.
npm install axios cheerio playwright
We are using Node v12, but you can always check the compatibility of each feature.
Axios is a "promise based HTTP client" that we will use to get the HTML from a URL. It allows several options such as headers and proxies, which we will cover later. If you use TypeScript, they "include TypeScript definitions and a type guard for Axios errors."
Cheerio is a "fast, flexible & lean implementation of core jQuery." It lets us find nodes with selectors, get text or attributes, and many other things. We will pass the HTML to cheerio and then query it as we would in a browser environment.
Scraping the Basics
The first thing we need is the HTML. We installed Axios for that, and its usage is straightforward. We'll use scrapeme.live as an example, a fake website prepared for scraping.
Nice! Then, using cheerio, we can query for the two things we want right now: paginator links and products. To know how to do that, we will look at the page with Chrome DevTools open. All modern browsers offer developer tools such as these. Pick your favorite.
We marked the interesting parts in red, but you can go on your own and try it yourselves. In this case, all the CSS selectors are straightforward and do not need nesting. Check the guide if you are looking for a different outcome or cannot select it. You can also use DevTools to get the selector.
On the Elements tab, right-click on the node ➡ Copy ➡ Copy selector. But the outcome is usually very coupled to the HTML, as in this case:
#main > div:nth-child(2) > nav > ul > li:nth-child(2) > a. This approach might be a problem in the future because it will stop working after any minimal change. Besides, it will only capture one of the pagination links, not all of them.
We could capture all the links on the page and then filter them by content. If we were to write a full-site crawler, that would be the right approach. In our case, we only want the pagination links. Using the provided class,
.page-numbers a will capture all and then extract the URLs (
hrefs) from those. The selector will match all the link nodes with an ancestor containing the class
As for the products (Pokémon in this case), we will get id, name, and price. Check the image below for details on selectors, or try again on your own. We will only log the content for now. Check the final code for adding them to an array.
As you can see above, all the products contain the class
product, which makes our job easier. And for each of them, the
h2 tag and
price node hold the content we want. As for the product ID, we need to match an attribute instead of a class or node type. That can be done using the syntax
node[attribute="value"]. We are looking only for the node with the attribute, so there is no need to match it to any particular value.
There is no error handling, as you can see above. We will omit it for brevity in the snippets but take it into account in real life. Most of the time, returning the default value (i.e...., empty array) should do the trick.
Now that we have some pagination links, we should also visit them. If you run the whole code, you'll see that they appear twice - there are two pagination bars.
We will add two sets to keep track of what we already visited and the newly discovered links. We are using sets instead of arrays to avoid dealing with duplicates, but either one would work. To avoid crawling too much, we'll also include a maximum.
For the next part, we will use async/await to avoid callbacks and nesting. An async function is an alternative to writing promise-based functions as chains. In this case, the Axios call will remain asynchronous. It might take around 1 second per page, but we write the code sequentially, with no need for callbacks.
There is a small gotcha with this:
await is only valid in async function. That will force us to wrap the initial code inside a function, concretely in an IIFE (Immediately Invoked Function Expression). The syntax is a bit weird. It creates a function and then calls it immediately.
As said before, we need mechanisms to avoid blocks, captchas, login walls, and several other defensive techniques. It is complicated to prevent them 100% of the time. But we can achieve a high success rate with simple efforts. We will apply two tactics: adding proxies and full-set headers.
There are Free Proxies even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.
Note that these free proxies might not work for you. They are short-time lived.
Paid proxy services, on the other hand, offer IP Rotation. Meaning that our service will work the same, but the target website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.
We will use httpbin for testing. It offers several endpoints that will respond with headers, IP addresses, and many more.
The next step would be to check our request headers. The most known one is User-Agent (UA for short), but there are many more. Many software tools have their own, for example, Axios (
axios/0.21.1). In general, it is a good practice to send actual headers along with the UA. That means we need a real-world set of headers because not all browsers and versions use the same ones. We include two in the snippet: Chrome 92 and Firefox 90 in a Linux machine.
Until now, every page visited was done using
This approach comes with its own problem: take a look a the User-Agents. The Chromium one includes "HeadlessChrome," which will tell the target website, well, that it is a headless browser. They might act upon that.
As with Axios, we can provide extra headers, proxies, and many other options to customize every request. An excellent choice to hide our "HeadlessChrome" User-Agent. And since this is a real browser, we can intercept requests, block others (like CSS files or images), take screenshots or videos, and more.
Now we can separate getting the HTML in a couple of functions, one using Playwright and the other Axios. We would then need a way to select which one is appropriate for the case at hand. For now, it is hardcoded. The output, by the way, is the same but quite faster when using Axios.
We already introduced async/await when crawling several links sequentially. If we were to crawl them in parallel, just by removing the
await would be enough, right? Well... not so fast.
The function would call the first
crawl and immediately take the following item from the
toVisit set. The problem is that the set is empty since the crawling of the first page didn't occur yet. So we added no new links to the list. The function keeps running in the background, but we already exited from the main one.
To do this properly, we need to create a queue that will execute tasks when available. To avoid many requests at the same time, we will limit its concurrency.
If you run the code above, it will print numbers from 0 to 3 almost immediately (with a timestamp) and from 4 to 7 after 2 seconds. It might be the hardest snippet to understand - review it without hurries.
queue in lines 1-20. It will return an object with the function
enqueue to add a task to the list. Then it checks if we are above the concurrency limit. If we are not, it will sum one to
running and enter a loop that gets a task and runs it with the provided params. Until the task list is empty, then subtract one from
running. This variable is the one that marks when we can or cannot execute any more tasks, only allowing it below the concurrency limit. In lines 23-28, there are helper functions
printer. Instantiate the queue in line 30 and enqueue items in 32-34 (which will start running 4).
We have to use the queue now instead of a for loop to run several pages concurrently. The code below is partial with the parts that change.
Remember that Node runs in a single thread, so we can take advantage of its event loop but cannot use multiple CPUs/threads. What we've seen works fine because the thread is idle most of the time - network requests do not consume CPU time.
To build this further, we need to use some storage (database) or distributed queue system. Right now, we rely on variables, which are not shared between threads in Node. It is not overly complicated, but we covered enough ground in this blog post.
- Understand the basics of website parsing and crawling.
- Separate responsibilities and use abstractions when necessary.
- Apply the required techniques to avoid blocks.
- Be able to figure out the following steps to scale up.