Posts

Showing posts from 2016

Automating page scraping with Selenium and Chrome

Image
Sometimes you are interested in information on some page, be it on some social media, or on a news site, or a blog. But you do not have an automated way via API to access this data. I.e. you want to stock the public FB posts on the girl that did not accept your friend request. Or you want to monitor a site for product availabily or price range. Or you might want to get notified if the university web site pushes a new update. Be warned that it might be agains the respective sites policy to make such automated scannings, or you might risk suspending your account or legal proceedings. But lately I have the need to monitor a web site for auto parts. I was interested in the availability of a certain part for my car. I wanted to get notified when there is something new. The vendor does not have API or RSS feeds, but claims that the site is updated the moment there is a new part available and the moment when an old part is sold. Fortunatelly the filter for figuring out the car, model, yea...

Upload file to Google Storage from memory with Node.js gcloud library

These days was interacting with the Google Cloud Storage (GCS) with the gcloud nodejs library(version 0.31.0) and was surprised that there is no API method to directly upload file from memory. So if you are looking for it - the shorthand is just to use the write stream returned from createWriteStream() on file object: var gcloud = require( "gcloud" ); var storage = gcloud.storage({projectId: "dir-bg-scraper" }); var sample_TXT_file = "Hello there, \r \n " + "This is sample text file with important info \r \n \r \n " + "Best Regards!" ; storage.createBucket( "dir-bg-scraper" , function (err, bucket, apiResponse) { if (err === null ) { // bucket is the newly created "Test-Bucket" console .log ( "'dir-bg-scraper.appspot.com' successfully created!" ); // create 'README.md' file into this bucket var new_file = bucket.file(...

Google Datastore Namespaces

Image
The Google Datastore is one of the Google publicly available scallable No-SQL solution. The other is the BigTable  - which can be installed on a swarm of Compute Engine virtual machines. The Datastore is build on top of the BigTable. So with the BigTable you could get the raw performance, but you miss some goodies, like transactions and SQL like query language available with the Datastore. The Datastore terminology gets me confused. They introduce "Dataset" and "Namespace" - which is not immediately clear, and had me some issues till I figure it out. That is why started this post. From the welcome page of the Datastore project the terminology is defined as Concept Datastore Relational database Category of object Kind Table One object Entity Row Individual data for an object Property Field Unique ID for an object Key Primary key But the Dataset is not defined what it means in the documentation. Through trial and error I found out that the Dataset is the ...

Google Datastore intro

Image
I have been getting updated on the Google Datastore for several months now. It seems the major promotions for it started back in 2008, or at least that is when the most videos on Google IO started to appear. It got some hype until 2012, and then no more promotions from Google - they focus now on mobile, web and material designs. Anyway, I've watched multiple time the following videos - and recommend them if you want to know more about scallable, sorted, distributed, persisted nosql solution "Datastore under the covers", 2008 talk.  The slides are a mess, but Ryann Barret makes a good introduction  to how the entities and inded are layed out, and how transactions work. "Scalable, Complex Apps on App Engine" 2009 talk Brett Slatkin presents solutions for complex datastructures  having in mind the limitations of the Datastore Other notables videos to checkot are: Google I/O 2010 - Next gen queries Google I/O 2011: More 9s Please: Under Th...

Scrapping the timeline of a Facebook page

Have you ever wandered how to get all the timeline posts for a given Facebook page. And by all - I mean all of them until the very first post that was created about the page. Including all the comments, and replies on the comments. Including all the likes on the posts, comments and replies.. and not just the number of the likes, but even the user who made the like. If you are admin of the page, you can access all the posts, even the ones that are hidden, gated or restricted. But if you are not admin, still you can access all the public ones. To work with the Facebook Graph API you need an application. With the application you could get a user access token, or just for trying things out - you could get the App ID and App Secret from app dashboard in the FB developers at: https://developers.facebook.com/apps/ /dashboard/ Facebook Graph API is a mesh of object with links between them. To access the timeline posts for a page we start from the page itself. Lets take for an example th...