Extracting (scraping) webpage content with Puppeteer

I’m working on a personal project to extract some content from a number of websites. This idea started out trying to use a REST api to a site but it turned out to be far more complicated involving far more steps that I was prepare to invest the time in to get working, so I realized I could just scrape their HTML content instead to get what I was looking for.

There are many HTML scraping and parsing libraries. I took a look at x-ray and a few others, but what I discovered is that many sites detect the fact that you’re not browsing the site in a real browser (presumably from checking your user-agent and cookies etc) and then force you into completing CAPTCHAs etc to prove that you’re not a robot.

I then stumbled across Apify, and more specifically Puppeteer. I think Apify does far more than I need for this little project. Instead, since Apify can also use Puppeteer under the covers, I found that just using Puppeteer directly does all that I need.

Here’s a small script for extracting some text from an example site:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.example-website.com');
const extractedValue = await page.$eval('#element-id', el => el.innerHTML);
console.log('extracted value: ' + extractedValue);

await browser.close();
})();

Building a React frontend for my AWS Lambda Sudoku solver

Over the past few months I built an implementation of Donald Knuth’s Algorithm X using Dancing Links in Java to solve Sudoku puzzles.

This was a fascinating exercise in itself (you can read more my experience here), but the next logical step would be to package it up in a way to share it online.

Since I’m pursuing my AWS certifications right now, one interesting and low cost approach to host the the Solver implementation is to package it as an AWS Lambda. Sudoku Solver as a Service? Done. I exposed it through AWS API Gateway. It accepts an request payload that looks like this:

{"rows":["...81.67.","..749.2.8",".6..5.1.4","1....39..","4...8...7","..69....3","9.2.3..6.","6.1.743..",".34.69..."]}

and returns a response with a solution to the submitted puzzle request like this:

{"rows":["349812675","517496238","268357194","185723946","493681527","726945813","972538461","651274389","834169752"]}

The request and response payloads are an array of Strings, where each item represents a String of values concatenated together for one row in the grid, with ‘.’s for unknowns.

I’m still learning React as I go, and while building this front end for my Lambda Sudoku Solver I learnt some interesting things about React and Javascript. The source for the app is shared here.

I used Flux to structure the app, so there main parts of the app are:

  • a main, highlevel Container component,
  • a CellComponent that renders each cell in the Sudoku grid,
  • an Action that handles the interaction with the AWS Lambda
  • a Store that holds the results from calling the Lambda

I don’t want to focus on the pros and cons of using React or Flux (and this is not intended to be a how-to on building an app using React) as there were some other specific issues I ran into that were interesting learning opportunities. A couple of these I already captured in separate posts, so I’ll include these links below.

Iteration 1: onChange handler per row

My first approach to maintaining the state for the display of the grid and the handler for changes to each cell was to keep it simple and have a seperate array of values per row, and a separate onChange handler for each row. This is not a particularly effective way to structure this as there’s duplication in each of the 9 handlers.

The State looked like this:

this.state =
{
row1 : [],
row2 : [],
row3 : [],
row4 : [],
row5 : [],
row6 : [],
row7 : [],
row8 : [],
row9 : []
};

And each of the handlers looked like this, one handler per row, so handleChangeRow1() through handleChangeRow9():

handleChangeRow1(index, event){
console.log("row 1 update: " + event.target.value);
var updatedRow = [...this.state.row1];
updatedRow[index] = event.target.value
this.setState( { row1 : updatedRow } );
}

This approach needed 9 versions of the function above, each one specifically handling updates to the state for a single row. We’ll come back to improving this later.

The interesting thing to notice at this point that to update an array in React state, you need to clone a copy of the array, and then update the copy. I used the spread operator ‘…’ to clone the array.

Each row in the grid I rendered separately like this (so this approach needed 9 of these blocks):

<div>
{
this.state.row1.map( (cell, index) => (
<CellComponent key={index} value={this.state.row1[index]}
onChange={this.handleChangeRow1.bind(this, index)}/>
)
)}
</div>

This was my first working version of the app, at least at the point where I could track the State of the grid as a user entered or changed values in the 9×9 grid. Next steps was to improve the approach.

Iteration 2: Using an array of arrays for the State

The first improvement was to improve the State arrays, moving to an array of arrays. This is easily setup like this:

this.state =
{
grid: []
};

for (var row = 0; row < 9; row++) {
this.state.grid[row] = [];
}

Iteration 3: One onChange handler for all rows

Instead of a handler per row, I parameterized the onChange handler to reused for all rows. This is what I ended up with:

handleGridChange(row, colIndex, event) {
console.log("row [" + row + "] col [" + colIndex + "] : " + event.target.value);
var updatedGrid = [...this.state.grid];
updatedGrid[row][colIndex] = event.target.value;

//call Action to send updated data to Store
SudokuSolverAction.updatePuzzleData(updatedGrid);
}

Using .map() on each of the rows in State, I then rendered each row of the grid like this, passing the current row index and column index as params into handleGridChange():

<tr>
{
this.state.grid[0].map((cell, colIndex) => (
<td key={"row0" + colIndex}>
<CellComponent value={this.state.grid[0][colIndex]}
onChange={this.handleGridChange.bind(this, 0, colIndex)}/>
</td>
)
)}
</tr>

I’m sure there’s a way to use a nested .map() of the results of a .map() or some other clever approach to render the whole grid in a single go, but rendering each of rows individual is an ok approach with me since there’s only 9 rows. If the number of rows were much more than 9 then I’d spend some time working on a better approach, but I’m ok with this for now.

Flux Action and Store

The Action to call the Lambda, and maintaining the state of the responses in the Store was pretty simple. You can check out the source here if you’re interested.

CSS styling for the grid

One last thing to do was to style the grid so it looks like a usual Sudoku grid, with vertical and horizontal lines at 3 and 6, to divide the grid in 3×3 of the 3×3 squares. This took some reading to find out how to easily do this, but turns out CSS nth-child() psuedoclass handles this perfectly. I covered this in this post here.

Take a look at the app

I might move this to a more permanent home later, but if you want to check out the app, you can take a look here.

CSS styling alternating items, rows or columns using nth-child

The nth-child CSS psuedoclass allows you to apply a style to child elements in lists, such as <li> items or table <tr> rows or <td> cells.

I’ve been working on a React frontend client to a Sudoku puzzle solver which is deployed as am AWS Lambda. It’s easy to build a table of <input> fields, but how can you styles that alternate only every 3 elements?

nth-child is perfect for this. First, you can specify the repeating style every nth element, starting with a thick border every 3 <td> cells : nth-child(3n):

/* right border for every 3rd td */
.sudoku-grid tbody tr td:nth-child(3n){
border-right: 3px solid;
}

This results in this:


To handle the first column, you can apply a style to a specific element only, with:

/* set 1st column left border */
.sudoku-grid tbody tr td:nth-child(1){
border-left: 3px solid;
}

This gives:

Now similarly for the <tr> rows, every 3rd row:

 /* bottom border for every 3rd tr row */
.sudoku-grid tbody tr:nth-child(3n){
border-bottom: 3px solid;
}

And 1st row, now we’re done!

/* top border for first row */
.sudoku-grid tbody tr:nth-child(1){
border-top: 3px solid;
}

AWS S3 error: “The bucket you are attempting to access must be addressed using the specified endpoint”

Most errors on AWS are explicit and self explanatory, but once in a while you run into something that tells you something went wrong, but not how to fix it.

When setting up an S3 bucket to serve a static website, I got this error being returned as 301s for referenced files:

<Error>
<Code>PermanentRedirect</Code>
<Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message>
<Endpoint>s3.amazonaws.com</Endpoint>
<Bucket>static</Bucket>
...
</Error>

If you click an object in a bucket and click the overview tab, you can get a URL direct to the object, for example:

To load a static site from S3 though, you need to use a URL referencing the bucket in the server name, so instead of this:

https://s3-us-west-1.amazonaws.com/react-sudoku-solver/index.html

You should use this:

http://react-sudoku-solver.s3-us-west-1.amazonaws.com/index.html

(I’m currently working on a React app as a Sudoku Solver – more on this later)