How to design a Big Data Platform


Data Producer: Rider / Driver App, API / Services, Dispatch (GPS logs), Map Services

Data Product (Uber Surge Pricing – high in traffic, low in free time)

Real time pipeline (Kafka -> Surge / ELK / Storm / Samza) -> Mobile App / Debugging / Real time, fast analytics / Alerts, Dashboards

Batch pipeline (Hadoop, Vertica) -> Application Data Science / AdHoc Exploration / Analytics Reporting


Things Simple Works

User <—> app <—> DB


Heating Up

User <—> proxys <—> apps <—> DBs


Unbalanced Requirements

Inside app (CPU / Mem / IO)



User Authentication

News Feed / Timeline


Failure is Normal (Network / Hardware)


Computer Power (CPU) / Network bandwidth / Hard-drive disk space is limited on a single server

Failure always exists (Network / Hardware)


When to consider to scale?

Hardware usage is over 90% and request number is going up

Bottleneck (e.g. QPS > 1s, memory, network bandwidth, IO)


Data Pipeline (core piece of infrastructure that carries all data in the company)

Loading / Receiving incoming data

Data Storage

Data computation


Distributed system is a must have

High scalability


SMACK = Spark + Mesos (Resource Management) + Akka (Reactive Programming framework, event/data driven) + Cassandra (data storage) + Kafka (receiving message / data transfer)


Cloud service is more and more powerful, will I care less about infrastructure management?

If you are not familiar with infrastructure, you might abuse using the API service (e.g. debug is always turned off on production environment, otherwise might block requests on app or DB server)


Common Data Pipeline

Data ingestion layer

Data storage layer

Data computation layer

Cluster scheduling layer


Data ingestion layer

High Throughput

Merely pass through

Simple process logic

Cannot serve as a storage layer


Kafka (Fast – MB/s to thousands clients / Scalable / Durable – disk) – distributed messaging system (like post service, USPS / UPS / Shunfeng)


Data Storage Layer

Cassandra (high availability / no single point of failure) – distributed storage system (split data into different servers, centralized store of metadata)


DB – store and search (index) data


Data is calculated into a ring, each server takes a portion (consistent hashing)


Data computation layer

High available / Fast computation / Able to handle spike traffic / Retry on failure



Distributed calculation / Split any calculation into steps (Map / Reduce)

Think of 3 * 3 = 3 + 3 + 3

Spark (open source cluster computing framework)

Respond of limitation of Hadoop

Computation optimization

In memory computing (fast, expensive, but cannot process too much data like TB or PB)


Cluster scheduling layer


Deployment Consideration

Runs on Cloud Service for scalability (AWS / Google Cloud Service)

Always setup monitoring for improvement


Latency Numbers Every Programmer Should Know


Region and Availability Zones

Region is a Geolocation where AWS data center locates (us-north virginia, eu-west Ireland)

Every region has multiple availability zones (for high availability reasons, earthquake, tornado, etc)


MongoDB (simple, small data)

Cassandra (higher requirement on QPS)


Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.

Core Java Volume I–Fundamentals (10th Edition) (Core Series)

Core Java, Volume II–Advanced Features (10th Edition) (Core Series)

Test-Driven Java Development

Java Concurrency in Practice

Java: An Introduction to Problem Solving and Programming (7th Edition)

Java 9 for Programmers (Deitel Developer Series)

Java SE8 for the Really Impatient: A Short Course on the Basics (Java Series)

Core Java for the Impatient

Java: The Beginners Guide for every non-programmer which will attend you trough your learning process

Java Deep Learning Essentials

Machine Learning in Java

Learning Reactive Programming With Java 8

Java 9 Programming By Example

Thinking in Java (4th Edition)

The Java EE Architect’s Handbook, Second Edition: How to be a successful application architect for Java EE applications

Java Artificial Intelligence: Made Easy, w/ Java Programming



About liyao13

Yao Li is a web and iOS developer, blogger and he has a passion for technology and business. In his blogs, he shares code snippets, tutorials, resources and notes to help people develop their skills. Donate $5 to him for a coffee with PayPal at About Me page and read more professional and interesting technical blog articles. Follow him @Yaoli0615 at Twitter to get latest tech updates.
This entry was posted in CS Research&Application, Uncategorized and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s