Automating page scraping with Selenium and Chrome

Sometimes you are interested in information on some page, be it on some social media, or on a news site, or a blog. But you do not have an automated way via API to access this data. I.e. you want to stock the public FB posts on the girl that did not accept your friend request. Or you want to monitor a site for product availabily or price range. Or you might want to get notified if the university web site pushes a new update. Be warned that it might be agains the respective sites policy to make such automated scannings, or you might risk suspending your account or legal proceedings.

But lately I have the need to monitor a web site for auto parts. I was interested in the availability of a certain part for my car. I wanted to get notified when there is something new. The vendor does not have API or RSS feeds, but claims that the site is updated the moment there is a new part available and the moment when an old part is sold.

Fortunatelly the filter for figuring out the car, model, year, and the part is all encoded in the URL. So we just need
  1. to point the browser everyday at the URL
  2. scan the avilable offerenings
  3. check if there is a  new offering
  4. send email

If you want to follow the particular example - the section for the rear view mirrors at the vendor looks like:

Project setup

Now let the fun begin, i.e. automating this task. To follow up you need to have Chrome installed on your Linux, Mac or Win box. It is possible easily to switch to Firefox, but you have to figure this out yourself.

If it matters currently I run the lates available node package
> node --version
v6.8.1

Setup a directory for our node.js code:
> mkdir scrape_for_auto_parts
> cd scrape_for_auto_parts
> git init
> npm init -f

We need to add the .gitignore file, to skip the node_modules and logs directory:
> cat .gitignore
node_modules
logs

We need to npm install the libraries that we would need
> npm install --save selenium-webdriver mongodb bluebird

Selenium setup

The way to control most of the popular browser is via Selenium. When we want to control the browser from node.js we could use webdriver.io or Selenium WebDriver. For this article we would use the later with its bindings for node.js.

There are number of ways to work with Selenium:
  1. Use a  Standalone Selenimum server to proxy your script request and the browser-specific drivers
  2. Download the browser drivers and just put them in the root directory
Lets use the last option, because it is the simplest, it is self containing, and easier to setup as daily cron. Go to the ChromeDriver page (or from here) and download the executable for your operating system. For writing this article I use Windows 10, but later will move everything to a Debian box. So I download and add at the root of the project:
  •  chromedriver.exe - for managing Chrome on the Windows box
  •  chromedriver for 64 bit linux  - for managing Chrome on the Debian box later.

Starting a browser to get page title

Now that the setup is complete, lets write a simple index.js to test out if everything so far works
var webdriver = require('selenium-webdriver');

var driver = new webdriver.Builder()
    .forBrowser('chrome')
    .build();

var URL_to_monitor = "https://www.megaparts.bg" +
                        "/products/avtochasti/chasti-po-kupeto-malogabaritni/ogledala" +
                        "?pa_marka-avtomobil=21" +
                        "&pa_model-avtomobil=212" +
                        "&productFilterSubmitted=1";

driver.get(URL_to_monitor);
driver.getTitle().then(function(title){
  console.log(title);
})

driver.sleep(10000);
driver.quit();

If you run this:
> node index.js

and you see Chrome browser popping and then title of the page printed out on the console - you are all good!

Note that the "driver.sleep(10000);" is just so that you can see the page on the screen. This timer start only after document.ready() has been fired, i.e. the page is fully ready. So no need in our real scraping to wait to make sure that the page has loaded.

selenium-webdrive API overview

All the objects exposed by the selenion-webdriver are documented here. One thing that needed a bit of getting used to, is that all the calls made to the driver object 
  1. Return a promise
  2. Queued up internally
This means that the following code
driver.get(URL_to_monitor).then(function(){
  console.log("driver.get..");
});
driver.getTitle().then(function(title){
  console.log("driver.getTitle..");
})

not in parallel as we might expect from asynchronous nature of javascript, but rather:

  1. the "driver.get" will be added to the internal control flow queue
  2. the "driver.getTittle" will be added to the internal control flow queue
  3. On the next tick, the "driver.get" will get executed, since it is the first one on the queue. It will need some seconds to download the html, css, images and render them.
  4. Then the promise of the "driver.get" will be fullfilled wich will print "driver.get.."
  5. On the next tick, the "driver.getTitle" will get executed. This call is fast, since the page is already loaded, so only a lookup in the DOM is needed.
  6. Then the promise of the "driver.getTitle" will be fullfilled wich will print "driver.getTitle.."


In this way the Selenium team makes automating the browser perform like a synchronous language. I.e. the command to the driver that controls the browser are executed in sequence!
"
Of all the methods available on the WebDriver class - we would be needing the "findElement(locator)" which schedule a command to find a single or multiple element on the page. It is possible to search By: id, class, attribute, and others. Here is example how to get count of the parts listsed on the target page:
driver.findElements(By.className("product-grid-item"))
  .then(function(parts){
    console.log(parts.length);
  });

Putting it all together

The final code that graps the auto parts available on a web page, looks like:
var webdriver = require("selenium-webdriver");
var Promise = require("bluebird");

var By = webdriver.By;

var driver = new webdriver.Builder()
    .forBrowser('chrome')
    .build();

var URL_to_monitor = "https://www.megaparts.bg" +
                        "/products/avtochasti/chasti-po-kupeto-malogabaritni/ogledala" +
                        "?pa_marka-avtomobil=21" +
                        "&pa_model-avtomobil=212" +
                        "&productFilterSubmitted=1";

driver.get(URL_to_monitor).then(function(){
  console.log("Page is ready");
});
driver.getTitle().then(function(title){
  console.log(title);
});

driver.findElements(By.className("product-grid-item"))
  .then(function(parts){
    var parsed_parts = [];
    parts.forEach(function(part){
        parsed_parts.push( extract_info(part) );
    });

    return Promise.all(parsed_parts);
  })
  .then(function(parsed_parts){
    console.log(parsed_parts);
  });

driver.quit();

function extract_info(part) {
    return Promise.all([
        // hte product image
      part.findElement(By.css(".product-image img"))
              .getAttribute("src"),
        // the product ID
      part.findElement(By.css(".product-image"))
              .getAttribute("href"),
        // the description of the product
      part.findElement(By.css(".name"))
              .getAttribute("title"),
        // the price
      part.findElement(By.css(".price"))
              .getText(),
    ]);
}


Yes, in this particular case it would have been possible just to request the page(i.e. request.get) and parse the returned HTML, since this site generates the pages on the server. But if the web page uses JavaScript framework to render the page in the browser - then selenium is the answer.

Download the complete example

Clone the example:

git clone https://github.com/karpachev/scrape_for_auto_parts --branch scraping_chrome_step_1

and then download all npm dependancies:
npm install

Finally - run the scraping to see the list of auto parts on the page output into the console:
node index.js

Comments

Popular posts from this blog

Data types: Backend DB architecture

Node.js: Optimisations on parsing large JSON file

Back to teaching