Extracting (scraping) webpage content with Puppeteer

I’m working on a personal project to extract some content from a number of websites. This idea started out trying to use a REST api to a site but it turned out to be far more complicated involving far more steps that I was prepare to invest the time in to get working, so I realized I could just scrape their HTML content instead to get what I was looking for.

There are many HTML scraping and parsing libraries. I took a look at x-ray and a few others, but what I discovered is that many sites detect the fact that you’re not browsing the site in a real browser (presumably from checking your user-agent and cookies etc) and then force you into completing CAPTCHAs etc to prove that you’re not a robot.

I then stumbled across Apify, and more specifically Puppeteer. I think Apify does far more than I need for this little project. Instead, since Apify can also use Puppeteer under the covers, I found that just using Puppeteer directly does all that I need.

Here’s a small script for extracting some text from an example site:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.example-website.com');
const extractedValue = await page.$eval('#element-id', el => el.innerHTML);
console.log('extracted value: ' + extractedValue);

await browser.close();

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.