Backend DB architecture
With a friend, we were trying to prepare for a Facebook general architecture interview question. I proposed we try to solve the following problem (very keen to my heart):
Backend architecture for massive post analyticsSo basically - selecting a DB that can efficiently query and scale well with the number of users.
Data representation of a post
Each post has the following meta-data:- Customer ID
- Channel ID (i.e. the Facebook page ID, or Twitter account ID, etc)
- Created date - date & time when posted
- Post type - one of "TEXT", "IMAGE", "VIDEO", "LINK"
- Text
- Labels
- Has no labels
- Has ALL of the specified labels
- Has ANY of the specified labels
The analytics-data is just:
- 100 integers
Sample queries
A list of possible queries
- All posts for a "Channel ID" for the last 30 days
- All posts for all "Channel IDs" in a "Customer ID" for a given "Post type" for the last 60 days
- All posts in a "Customer ID" that have specified "Labels" for the last year
- All posts in a "Channel ID" of "Type"="TEXT" that has "no labels" in the last 6 months
POC parameters
We agreed that we should generate sample data for a Proof Of Concept. Each will choose a solution and will implement it. We would benchmark against each other.
We should generate sample data set:
- 5,000 Clients
- 50,000 Channels - randomly distributed in the clients
- 15 Gigabytes of total 16,040,001 posts. Each posts has:
- maximum of 10 "Labels"
- "Created date" - after 2018
Table of Contents for the series
- Requirements
- Data types
- Generate random CSV/JSON data (TBD..)
- Query Interface (TBD..)
- Linear scanner (TBD..)
Comments
Post a Comment