How to do Web Scraping with Puppeteer and NodeJS

Picture of Ander
By Ander · September 23, 2022 · 13 min read · Twitter
Ander is a web developer who has worked at startups for 12+ years. He began scraping social media even before influencers were a thing. Geek to the core.

Web scraping and crawling is the process of automatically extracting large amounts of data from the web. Data extraction is on the rise, but most websites don't provide data via API. Follow this tutorial to learn how to use Puppeteer for web scraping in NodeJS and extract that information.

Headless browsers are thriving since antibot systems are common and available for anyone. Bypassing defensive software with static scraping solutions such as Axios is close to impossible. And here is where web scraping with Puppeteer enters.

The other principal upside is extracting content from websites rendered with JavaScript, called dynamic scraping.

As you might know, Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. It allows us to browse the Internet with a headless browser programmatically. Designed for testing, we'll see how to do web scraping with Puppeteer.

Puppeteer allows you to do almost anything that you can do manually in a browser. Visit pages, click links, submit forms, take screenshots, and many more.

Can Puppeteer be used for web scraping?

Of course, web scraping is the second most common use for Puppeteer! 😹

Applying those features plus the ability to extract content, we start to see how to build a crawler and parser. We can tell Puppeteer to visit our target page, select some elements, and extract data from them. Then, parse all the links available on the page and add them to crawl.

Does it sound like a web scraper? Let's dive in!

What are the advantages of using Puppeteer for web scraping?

Axios and Cheerio are an excellent option for web scraping with Javascript. That approach has two problems: scraping dynamic content and anti-scraping software. We'll see later how to avoid and bypass antibot systems.

Since Puppeteer is a headless browser, it has no problem with dynamic content. It will load the target page and run the Javascript present on the page. Maybe triggering XHR requests to get extra content. You wouldn't be able to extract that with a static scraper. Or single-page applications (SPA), where the initial HTML is almost empty of data.

It will also render images and allow you to take screenshots. You could program your script to go to a certain page and take a screenshot every day at the same hour. Then analyze them to gain a competitive advantage. The options are endless!

Prerequisites

For the code to work, you'll need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running npm install. It'll create a package.json file with all the dependencies.

npm install puppeteer

The code runs in Node v16, but you can always check the compatibility of each feature.

How do you use Puppeteer to scrape a website?

After installing Puppeteer, you are ready to start scraping! Open your favorite editor, create a new file - index.js - and add the following code:

const puppeteer = require('puppeteer'); 
 
(async () => { 
	// Initiate the browser 
	const browser = await puppeteer.launch(); 
	 
	// Create a new page with the default browser context 
	const page = await browser.newPage(); 
 
	// Go to the target website 
	await page.goto('https://example.com'); 
 
	// Get pages HTML content 
	const content = await page.content(); 
	console.log(content); 
 
	// Closes the browser and all of its pages 
	await browser.close(); 
})();

You can run the script on NodeJS with node test.js. It will print the HTML content of the example page, containing <title>Example Domain</title>, for example.

You already scraped your first page. Nice!

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Selecting nodes with Puppeteer

Printing the whole page might not be a good solution for most use cases. We'd better select parts of the page and access their content or attributes.

Example.org Annotated
Click to open the image in fullscreen
As we see above, we can extract relevant data from the highlighted nodes. We will use CSS Selectors for that. Puppeteer Page API exposes methods to access the page as users browsing would.
  • $(selector) is like document.querySelector, will find an element.
  • $$(selector) executes document.querySelectorAll, locates all the matching nodes.
  • $x(expression) evaluates the XPath expression, which is useful to find text on a page or nodes.
  • evaluate(pageFunction, args) will execute any Javascript instructions on the browser and return the result.

And many more. It gives us all the flexibility we need.

We will now get the mentioned nodes. An h1 containing the title and the a tag's href attribute for the link.

await page.goto('https://example.com'); 
 
// Get the node and extract the text 
const titleNode = await page.$('h1'); 
const title = await page.evaluate(el => el.innerText, titleNode); 
 
// We can do both actions with one command 
// In this case, extract the href attribute instead of the text 
const link = await page.$eval('a', anchor => anchor.getAttribute('href')); 
 
console.log({ title, link });

The output will be this one:

{ 
	title: 'Example Domain', 
	link: 'https://www.iana.org/domains/example' 
}

As we see in the snippet, there are several ways to achieve the result. Puppeteer exposes several functions that will allow you to customize the data extraction.

All the info was already present on the first load, but how can we get dynamic content?

Waiting for content to load or appear

How do we scrape data that isn't present? It might be on a script (React uses JSON objects) or after an XHR request to the server. Puppeteer allows us to wait for the content. It might wait for network status or elements to be visible. Here we mention some, but, again, there are more. Check the Page API documentation for more info.
  • waitForNetworkIdle stops the script until the network is idle.
  • waitForSelector pauses until a node that matches selector is present.
  • waitForNavigation waits for the browser to navigate to a new URL.
  • waitForTimeout sleeps a number of milliseconds but is now obsolete and not recommended.

We'll now switch the target website and go to a YouTube video. The comments, for example, are loaded asynchronously after an XHR request. That means that we must wait for that content to be present.

YouTube recommended videos
Click to open the image in fullscreen

Below you can see how to do it using waitForNetworkIdle.

(async () => { 
	const browser = await puppeteer.launch(); 
	const page = await browser.newPage(); 
	await page.goto('https://www.youtube.com/watch?v=tmNXKqeUtJM'); 
 
	const videosTitleSelector = '#items h3 #video-title'; 
	await page.waitForSelector(videosTitleSelector); 
	const titles = await page.$$eval( 
		videosTitleSelector, 
		titles => titles.map(title => title.innerText) 
	); 
	console.log(titles ); 
 
	// [ 
	//	 'Why Black Holes Could Delete The Universe – The Information Paradox', 
	//	 'Why All The Planets Are On The Same Orbital Plane', 
	//	 ... 
	// ] 
 
	await browser.close(); 
})();

We can also use waitForNetworkIdle. The output will be similar, but not the explanation behind it. This second option will pause the execution until the network goes idle. It might not work in some cases where the page is full of items, slow to load, or streams content, among others.

waitForSelector, in contrast, will only check that at least one comment title is present. When a node like that appears, it will return and resume the execution.

Now that we can scrape data, how do we crawl new pages?

How do you scrape multiple pages using Puppeteer?

We'll need links first! We can't build a web crawler with only our seed URL.

Let's assume that we are only interested in related videos. The ones shown in the section above. But now, we should scrape the links instead of the titles:

page.waitForSelector('#items h3 #video-title'); 
 
const videoLinks = await page.$$eval( 
	'#items .details a', 
	links => links.map(link => link.href) 
); 
console.log(videoLinks); 
 
// [ 
//	'https://www.youtube.com/watch?v=ceFl7NlpykQ', 
//	'https://www.youtube.com/watch?v=d3zTfXvYZ9s', 
//	'https://www.youtube.com/watch?v=DxQK1WDYI_k', 
//	'https://www.youtube.com/watch?v=jSMZoLjB9JE', 
// ]

Great! You now went from a single page to 20. The next step is to navigate to those and continue extracting data and more links.

We already covered the topic of web crawling in Javascript, and we won't go into further detail here.

Long story short: visit each page, scrape the data and add the links to a list. Then repeat. It's a bit more complicated, but you get the point.

Alright! A quick recap of what we've seen, the basics:
  1. Install and run Puppeteer.
  2. Scrape data using selectors.
  3. Extract links from the HTML.
  4. Crawl the new links.
  5. Repeat from #2.

Additional Puppeteer features

Now that we've covered the fundamentals of web scraping with Puppeteer, let's take a look at other features.

How do you take a screenshot using Puppeteer?

It'd be nice if you could see what the scraper is doing, right?

You can always run on headful mode with puppeteer.launch({ headless: false });. But that's not a real solution for large-scale scraping.

Luckily, Puppeteer also offers the feature to take screenshots. That's useful for debugging but also, for example, for creating snapshots of your target pages.

await page.screenshot({ path: 'youtube.png', fullPage: true });

As easy as that! The fullPage parameter defaults to false, so you have to pass it as true to get the whole page's content.

We'll see later how this will be helpful for us.

Execute Javascript with Puppeteer

As you already know, Puppeteer offers multiple features to interact with the target site and browser. But assume that you want to do something different. Something that you can run with Javascript directly on the browser. How can you do that?

Puppeteer's evaluate function will take any Javascript function or expression and execute it for you. With evaluate, you can virtually do anything on the page. Add or remove DOM nodes. Modify styles. Check for items on localStorage and expose them on a node. Read and modify cookies.

const storageItem = await page.evaluate("JSON.parse(localStorage.getItem('ytidb::LAST_RESULT_ENTRY_KEY'))"); 
console.log(storageItem); 
 
// { 
//	data: { hasSucceededOnce: true }, 
//	expiration: 1665752343778, 
//	creation: 1663160343778 
// }

We opted for the localStorage example. You can access a key that YouTube generates and return it with evaluate. As said before, you can control almost anything that happens on the browser. Even things that aren't visible to users.

Take into account that we aren't adding exception control or any defensive measure for brevity. In the case above, if the key does not exist, it would return null. But some other issues might fail and crush your scraper.

Submit a form with Puppeteer

Another typical action when browsing is submitting forms. You can replicate that behavior when web scraping. A use case would be to use Puppeteer to login to a website.

Following our example, we will fill in the search form and submit it. For that, we need to click a button and type text. Our entry point will be the same. Then, the search form will take us to a different page.

Remember that Puppeteer will browse the web without cookies - unless you tell it otherwise. In YouTube's case, that means seeing the cookie banner on top of the page. And you can't interact with the page until you accept or reject them. So we must remove that. Otherwise, the search won't work.

YouTube cookie dialog
Click to open the image in fullscreen

We have to locate the button, wait for it to be present, click it, and wait for a second. The last step is necessary for the dialog to disappear.

According to the click's documentation, we should click and wait for navigation using Promise.all. It didn't work in our case, so we opted for the alternative way.

Don't worry about the long CSS selector. There are several buttons, and we have to be specific. Besides, YouTube uses custom HTML elements such as ytd-button-renderer.

const cookieConsentSelector = 'tp-yt-paper-dialog .eom-button-row:first-child ytd-button-renderer:first-child'; 
 
await page.waitForSelector(cookieConsentSelector); 
page.click(cookieConsentSelector); 
await page.waitForTimeout(1000);

The next step is to fill in the form. In this case, we'll use two Puppeteer functions: type to enter the query and press to submit the form by hitting Enter. We could also click the button.

As you can see, we are coding the instructions that a user would perform directly on the browser.

const searchInputEl = await page.$('#search-form input'); 
await searchInputEl.type('top 10 songs'); 
await searchInputEl.press('Enter');

Lastly, wait for the search page to load, and take a screenshot. We've already seen how to do that with Puppeteer.

await page.waitForSelector('ytd-two-column-search-results-renderer ytd-video-renderer'); 
await page.screenshot({ path: 'youtube_search.png', fullPage: true });
YouTube search page
Click to open the image in fullscreen

And there you have it!

Block or intercept requests in Puppeteer

As you can see on the screenshot, the scraper is loading the images. Which is good for debugging purposes. But not for a large-scale crawling project.

Web scrapers should optimize resources and increase the crawling speed when possible. And not loading pictures is an easy one. To that end, we can take advantage of Puppeteer's support for resource blocking or intercepting requests.

By calling page.setRequestInterception(true), Puppeteer will enable you to check requests and abort them based on type, for example. It's crucial to run this part before visiting the page.

await page.setRequestInterception(true); 
 
// Check for files that end/contains png or jpg 
page.on('request', interceptedRequest => { 
	if ( 
		interceptedRequest.url().endsWith('.png') || 
		interceptedRequest.url().endsWith('.jpg') || 
		interceptedRequest.url().includes('.png?') || 
		interceptedRequest.url().includes('.jpg?') 
	) { 
		interceptedRequest.abort(); 
	} else { 
		interceptedRequest.continue(); 
	} 
}); 
 
// Go to the target website 
await page.goto('https://www.youtube.com/watch?v=tmNXKqeUtJM');

Not the most elegant solution - for now - but it gets the job done.

YouTube without images
Click to open the image in fullscreen

Each of the intercepted requests is an HTTPRequest. Apart from the URL, you can access the resource type. It makes it simpler for us to block all images.

// list the resources we don't want to load 
const excludedResourceTypes = ['stylesheet', 'image', 'font', 'media', 'other', 'xhr', 'manifest']; 
page.on('request', interceptedRequest => { 
	// block resources based in their type 
	if (excludedResourceTypes.includes(interceptedRequest.resourceType())) { 
		interceptedRequest.abort(); 
	} else { 
		interceptedRequest.continue(); 
	} 
});

These are the two main ways of blocking requests with Puppeteer. We can go to the detail checking the URL or block in a more general approach with types.

We won't go into details, but there is a plugin to block resources. And a more specific one that implements adblocker.

By blocking these resources, you might be saving 80% on bandwidth! It comes as no surprise that most of the content on the Internet nowadays is image or video based. And those weigh way more than plain text.

On top of that, less traffic means faster scraping. And if you are employing metered proxies, cheaper scraping too.

Speaking of proxies, how can we use them in Puppeteer?

Avoiding bot detection

It comes as no surprise that antibot software is more common every day. Almost any website can run a defensive solution thanks to their easy integration. If you stay with us, you will learn how to bypass antibot solutions using Puppeteer: Cloudflare or Akamai, for example.

As you might have guessed, the first and more common solution to avoid detection is thanks to proxies.

Using proxies with Puppeteer

Proxies are servers that act as intermediaries between your connection and your target site. You will send your requests to the proxy, and it will then relay them to the final server.

Why would we need an intermediary? As you might have guessed, it will be slower. But more effective for web scraping.

Probably the easiest way to ban a scraper is by its IP. Millions of requests from the same IP in just a day? It's a no-brainer, and any defensive system will block those connections.

But thanks to proxies, you can have different IPs. Rotating proxies can assign a new IP per request, making it more difficult for antibots to ban your scraper.

We will use a free proxy for the demo, but we do not recommend them. They might work for testing but are not reliable. Note that the one below might not work for you. They are short-time lived.

(async () => { 
	const browser = await puppeteer.launch({ 
		// pass the proxy to the browser 
		args: ['--proxy-server=23.26.236.11:3128'], 
	}); 
	const page = await browser.newPage(); 
 
	// example page that will print the calling IP address 
	await page.goto('https://www.httpbin.org/ip'); 
 
	const ip = await page.$eval('pre', node => node.innerText); 
	console.log(ip); 
	// { 
	//	"origin": "23.26.236.11" 
	// } 
 
	await browser.close(); 
})();

Puppeteer accepts a set of arguments that Chromium will set on launch. You can check their documentation on network settings for more info.

This implementation will send all the scraper's requests using the same proxy. That might not work for you. As mentioned above, unless your proxy rotates the IPs, the target server will see the same IP once and again. And ban it.

Fortunately, there are Node JS libraries that help us rotate proxies. puppeteer-page-proxy supports both HTTP and HTTPS proxies, authentication, and changing the used proxy per page. Or even per request, thanks to request interception (as we saw earlier).

Avoid geoblocking with premium proxies

Some antibot vendors, like Cloudflare, allow clients to customize the challenge level by location.

Let's take, for example, a store based in France. It might sell a small percentage in the rest of Europe but doesn't ship to the rest of the world.

In that case, it would make sense to have different levels of strictness. Lower security in Europe since it's more common to browse the site there. Higher challenge options when accessing from outside.

The solution is the same as the section above: proxies. In this case, they must allow geolocation. Premium or residential proxies usually offer this feature. You would get different URLs for each country you want, and those will only use IPs from the selected country. They might look like this: "http://my-user--country-FR:[email protected]:1234".

Setting HTTP headers in Puppeteer

Puppeteer sends, by default, HeadlessChrome as its user agent. No need for the latest tech to realize that it might be web scraping software.

Again, there are several ways to set HTTP headers in Puppeteer. One of the most common is using setExtraHTTPHeaders. You have to execute all header-related functions before visiting the page. Like this, it will have all the required data set before accessing any external site.

But be careful with this one if you use it to set the user agent.

const page = await browser.newPage(); 
 
// set headers 
await page.setExtraHTTPHeaders({ 
	'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 
	'custom-header': '1', 
}); 
 
// example page that will print the sent headers 
await page.goto('https://www.httpbin.org/headers'); 
 
const pageContent = await page.$eval('pre', node => JSON.parse(node.innerText)); 
const userAgent = await page.evaluate(() => navigator.userAgent); 
console.log({ headers: pageContent.headers, userAgent }); 
 
// { 
//	 headers: { 
//		'Custom-Header': '1', 
//		'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 
//		... 
//	 }, 
//	 userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4950.0 Safari/537.36' 
// }

Can you spot the difference Antibot systems sure can.

We sent the headers, no problem there. But the navigator.userAgent property present on the browser is the default one — an easy check. We will add another two in the following snippet (appVersion and platform) to see that it goes right.

Let's try the next way to change the user agent: via arguments on browser creation.

const browser = await puppeteer.launch({ 
	args: ['--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'], 
}); 
 
const appVersion = await page.evaluate('navigator.appVersion'); 
const platform = await page.evaluate('navigator.platform'); 
 
// { 
//	userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 
//	appVersion: '5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 
//	platform: 'Linux x86_64', 
// }

Ups, we've got a problem here. Now we'll try the third option: setUserAgent.

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'); 
 
// { 
//	userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 
//	appVersion: '5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 
//	platform: 'Linux x86_64', 
// }

Same result as before. At least this option is easier to handle, and we could change the user agent per request. We can combine this with the user-agents NodeJS package.

There must be a solution, right? And once again, Puppeteer comes with evaluateOnNewDocument, which can alter the navigator object before visiting the page. It means that the target page will see what we want to show.

For that, we have to overwrite the platform property. The function below will return a hardcoded string when Javascript accesses the value.

await page.evaluateOnNewDocument(() => 
	Object.defineProperty(navigator, 'platform', { 
		get: function () { 
			return 'Win32'; 
		}, 
	}) 
); 
 
// { 
//	userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 
//	platform: 'Win32', 
// }

Great! We can now set custom headers and the user agent. Plus, modify properties that don't match.

We've simplified this detection part a little bit. To avoid these problems, it's common to add a Puppeteer plugin called stealth. They also use this technique to avoid detection. You can browse the code, it's open source.

In case you are interested, we wrote a guide on how to avoid bot detection. It uses Python for the code examples, but the principles are the same.

You did it! You can start scraping now with Puppeteer and extract all the data you need.

Conclusion

In this Puppeteer tutorial, you've learned from installation to advanced topics. Don't forget the two main reasons to favor web scraping with Puppeteer or other headless browsers: extracting dynamic data and bypassing antibot systems.

We'd like you to go with the 5 main points clear:
  1. In which cases and why to use Puppeteer for web scraping.
  2. Install and apply the basics to start extracting data.
  3. CSS Selectors to get the data you are after.
  4. If you can do it manually, Puppeteer might have a feature for you.
  5. Avoiding bot detection with good proxies and HTTP headers.

Nobody said that using Puppeteer for web scraping was easy, but you're closer to accessing any content you want!

We left many things out, so check the official documentation when you need additional features. And remember that we didn't cover web crawling either. You need to go from one page to thousands and scale your scraping system.

Thanks for reading! We hope that you found this guide helpful. You can sign up for free, try ZenRows, and let us know any questions, comments, or suggestions.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.