Web Scraping with Puppeteer and Node.js

Web scraping is a powerful technique used to extract data from websites, and Puppeteer is a popular Node.js library used to automate web scraping tasks.

In this article, we will explore a Puppeteer script that scrapes book data from a website using Node.js. This article is intended for those who are new to web scraping or those who want to learn how to use Puppeteer for web scraping tasks. By the end of this article, you will have a better understanding of how to extract data from websites using Puppeteer and Node.js.

Lets start coding. Open terminal, navigate into the desired directory and type

npm init -y

This will create a package.json file. Open this directory in code editor. At the root level create index.js file. This will be our entry file. Next we will install puppeteer package. In terminal type

npm install puppeteer

Navigate to Books to Scrape to see the structure of website we are going to scrape. There are 20 books showing in single page. We will scrape title, image url and price for every book.

In index.js file copy paste the following code

const puppeteer = require("puppeteer");
const fs = require("fs");

const URL = "https://books.toscrape.com/";

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto(URL);

  await browser.close();
})();

At the top we imported puppeteer and fs module. fs module enables interacting with the file system. We will use it to save data in file later. After that we create a constant URL and assigned our target url to it. We then created an anonymous function and invoked it immediately right after its creation with ().

Within the function we launched puppeteer and pass headless as true. This will allow us to view what's going on when we will run index.js. If you don't want to launch browser then simply remove headless property. Next we called newPage method on browser and stored in page constant. We then called goto method on page and passed the target url. Finally, we closed the browser by calling browser.close(). We will write all the code to grab data before browser.close(). Everything we have written so far returns promise that's why we added await keyword.

Save the file and in terminal type

node index.js

This will open a new browser window which will load https://books.toscrape.com/ and once its done it will be closed immediately due to await browser.close() we have added in our code. Before browser.close() add

await page.waitForNavigation({ timeout: 3000 });

Run index.js again and this time it will wait for 3 seconds after the page load then it will close the browser. That's page.waitForNavigation in action. You can comment out this line as it was just to show you that you can add timeout as well.

If you inspect books in browser you will see every book is inside <article> tag with class product_pod and that's exactly what we will target to grab reference to books. For title we will grab <a> tag within <h3> and so on. Lets write code for that. After await page.goto(URL); paste the following

let data = [];

  const books = await page.$$(".product_pod");

  for (let i = 0; i < books.length; i++) {
    const book = books[i];
    const title = await book.$eval("h3 > a", (el) => el.textContent);
    const img = await book.$eval("img", (el) => el.src);
    const price = await book.$eval(".price_color", (el) => el.textContent);

    data.push({
      title,
      img,
      price,
    });
  }

Here we initialized an empty array and store it in data variable. Next we get the reference to all the elements having class product_pod and store it in books. page.$$(".product_pod") is similar to document.querySelectorAll("product_pod"). $$ is used when you need to extract information from multiple elements that match a given selector. Then we created a for-loop so that we can loop through the books array and for every book we are targeting title, image src and price. $eval takes selector and returns the single matching element for eg: we get the title by providing the selectors h3 > a. Here we are targeting the <a> tag within <h3>. It returns the element and for title we are grabbing textContent field. Same goes for image and price, however for image we are grabbing the src field. For each iteration we push the result in data array.

After browser.close, add

fs.writeFileSync("books.json", JSON.stringify(data));

Here we are writing the data to the books.json file. Run index.js file again you will see the books.json file will be created at the root level of your project containing all the books record from https://books.toscrape.com. This will scrape the records for one page. If you take a look at the bottom of books.toscrape.com you will see next button. To get books from next page we have to click this next button and then repeat for-loop. Let's modify our code to grab records from multiple pages.

let pageCount = 0;
  let data = [];

  while (pageCount < 2) {
    const books = await page.$$(".product_pod");

    for (let i = 0; i < books.length; i++) {
      const book = books[i];
      const title = await book.$eval("h3 > a", (el) => el.textContent);
      const img = await book.$eval("img", (el) => el.src);
      const price = await book.$eval(".price_color", (el) => el.textContent);

      data.push({
        title,
        img,
        price,
      });
    }

    try {
      const nextPage = await page.$eval(".next > a", (el) => el.href);
      await Promise.all([
        page.goto(nextPage),
        page.waitForNavigation({ timeout: 0 }),
      ]);
    } catch (err) {
      break;
    }
    pageCount++;
  }

We added a new variable pageCount to track the pages. We then add a while-loop which will run until pageCount condition is true. In this example we only want to get data for two pages. After getting the data for first page, we are checking whether there is a next button or not. If it exists then we will grab href attribute which will be the link for next page. We then await for all the promises to resolve which are: goto next page and waitForNavigation so that we have content available for the next page before we try to grab books. Finally, we increased the pageCount.

Run index.js file again, if you have set headless to true then you will be able to see that after getting the books from first page it will navigate to second page and repeat the scraping process.

Here is the complete code

const puppeteer = require("puppeteer");
const fs = require("fs");

const URL = "https://books.toscrape.com/";

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto(URL);

  let pageCount = 0;
  let data = [];

  while (pageCount < 2) {
    const books = await page.$$(".product_pod");

    for (let i = 0; i < books.length; i++) {
      const book = books[i];
      const title = await book.$eval("h3 > a", (el) => el.textContent);
      const img = await book.$eval("img", (el) => el.src);
      const price = await book.$eval(".price_color", (el) => el.textContent);

      data.push({
        title,
        img,
        price,
      });
    }

    try {
      const nextPage = await page.$eval(".next > a", (el) => el.href);
      await Promise.all([
        page.goto(nextPage),
        page.waitForNavigation({ timeout: 0 }),
      ]);
    } catch (err) {
      break;
    }
    pageCount++;
  }

  await browser.close();

  fs.writeFileSync("books.json", JSON.stringify(data));
})();

This code provides a simple example of how to scrape data from a website using Puppeteer. It demonstrates how to navigate to a website, select elements on the page, and extract data from those elements. This script can be modified to scrape data from other websites and can be used as a starting point for more complex web scraping tasks.

Web Scraping with Puppeteer and Node.js

Share

Related Tutorials

Build Nodejs Passportjs MongoDB Authentication System