Backend DB architecture

- May 18, 2020

With a friend, we were trying to prepare for a Facebook general architecture interview question. I proposed we try to solve the following problem (very keen to my heart):

Backend architecture for massive post analytics

So basically - selecting a DB that can efficiently query and scale well with the number of users.

Data representation of a post

Each post has the following meta-data:

Customer ID
Channel ID (i.e. the Facebook page ID, or Twitter account ID, etc)
Created date - date & time when posted
Post type - one of "TEXT", "IMAGE", "VIDEO", "LINK"
Text
Labels

Has no labels
Has ALL of the specified labels
Has ANY of the specified labels

The analytics-data is just:

100 integers

Sample queries

A list of possible queries

All posts for a "Channel ID" for the last 30 days
All posts for all "Channel IDs" in a "Customer ID" for a given "Post type" for the last 60 days
All posts in a "Customer ID" that have specified "Labels" for the last year
All posts in a "Channel ID" of "Type"="TEXT" that has "no labels" in the last 6 months

POC parameters

We agreed that we should generate sample data for a Proof Of Concept. Each will choose a solution and will implement it. We would benchmark against each other.

We should generate sample data set:

5,000 Clients
50,000 Channels - randomly distributed in the clients
15 Gigabytes of total 16,040,001 posts. Each posts has:

maximum of 10 "Labels"
"Created date" - after 2018

Table of Contents for the series

Requirements
Data types
Generate random CSV/JSON data (TBD..)
Query Interface (TBD..)
Linear scanner (TBD..)

Search This Blog

on various topics