Extracting (scraping) webpage content with Puppeteer

I’m working on a personal project to extract some content from a number of websites. This idea started out trying to use a REST api to a site but it turned out to be far more complicated involving far more steps that I was prepare to invest the time in to get working, so I realized I could just scrape their HTML content instead to get what I was looking for.

There are many HTML scraping and parsing libraries. I took a look at x-ray and a few others, but what I discovered is that many sites detect the fact that you’re not browsing the site in a real browser (presumably from checking your user-agent and cookies etc) and then force you into completing CAPTCHAs etc to prove that you’re not a robot.

I then stumbled across Apify, and more specifically Puppeteer. I think Apify does far more than I need for this little project. Instead, since Apify can also use Puppeteer under the covers, I found that just using Puppeteer directly does all that I need.

Here’s a small script for extracting some text from an example site:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.example-website.com');
const extractedValue = await page.$eval('#element-id', el => el.innerHTML);
console.log('extracted value: ' + extractedValue);

await browser.close();
})();

Dependency management with npm

A few rough usage notes:

  • npm install module : download and install module. Saves dependency in node_modules by default
  • npm install module –save : saves module info in package.json
  • npm install module -g : downloads and installs module globally, not just in the current dir/project, so can be reused by all projects
  • npm init : creates a new package.json from answers to a few questions, plus any existing downloaded modules in node_modules (useful to recreate package.json if you didn’t install modules with –save initially)