Backend DB architecture

With a friend, we were trying to prepare for a Facebook general architecture interview question. I proposed we try to solve the following problem (very keen to my heart):
Backend architecture for massive post analytics 
So basically - selecting a DB that can efficiently query and scale well with the number of users.


Data representation of a post

Each post has the following meta-data:

  • Customer ID
  • Channel ID (i.e. the Facebook page ID, or Twitter account ID, etc)
  • Created date - date & time when posted
  • Post type - one of "TEXT", "IMAGE", "VIDEO", "LINK"
  • Text
  • Labels
    • Has no labels
    • Has ALL of the specified labels
    • Has ANY of the specified labels
The analytics-data is just:
  • 100 integers

Sample queries

A list of possible queries
  • All posts for a "Channel ID" for the last 30 days
  • All posts for all "Channel IDs" in a "Customer ID" for a given "Post type" for the last 60 days
  • All posts in a "Customer ID" that have specified "Labels" for the last year
  • All posts in a  "Channel ID" of  "Type"="TEXT" that has "no labels" in the last 6 months

POC parameters

We agreed that we should generate sample data for a Proof Of Concept. Each will choose a solution and will implement it. We would benchmark against each other. 

We should generate sample data set:
  • 5,000 Clients
  • 50,000 Channels - randomly distributed in the clients
  • 15 Gigabytes of total 16,040,001 posts. Each posts has:
    • maximum of 10 "Labels"
    • "Created date" - after 2018


Table of Contents for the series



Comments

Popular posts from this blog

Data types: Backend DB architecture

Node.js: Optimisations on parsing large JSON file

Back to teaching