NoSQL Learning Notes

nosql

What

NoSQL (originally referring to “non SQL” or “non relational”)[1] database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. (Not only SQL, support SQL query)

Why

Simplicity of design, simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases), and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as “more flexible” than relational database tables.

How

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages (instead of SQL, for instance the lack of ability to perform ad-hoc joins across tables), lack of standardized interfaces, and huge previous investments in existing relational databases. Most NoSQL stores lack true ACID transactions, although a few databases, such as MarkLogicAerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB have made them central to their designs. (See ACID and join support.)

Instead, most NoSQL databases offer a concept of “eventual consistency” in which database changes are propagated to all nodes “eventually” (typically within milliseconds) so queries for data might not return updated data immediately or might result in reading data that is not accurate, a problem known as stale reads. Additionally, some NoSQL systems may exhibit lost writes and other forms of data loss. Fortunately, some NoSQL systems provide concepts such as write-ahead logging to avoid data loss. For distributed transaction processing across multiple databases, data consistency is an even bigger challenge that is difficult for both NoSQL and relational databases. Even current relational databases “do not allow referential integrity constraints to span databases.” There are few systems that maintain both ACID transactions and X/Open XA standards for distributed transaction processing.

1401269083847

Database Decision Tree

if Running a cluster

if Need SQL query

NewSQL

else

if Relationships are very important

Graph based Database (Neo4J …)

else

if Key-based access only

Key-Value based Database (Redis, Riak, …)

else

if Need Ad-hoc query

Document based Database (MongoDB, CouchDB, …)

else

Column based Database (Cassandra, HBase, …)

else

if Need SQL query

Relational Database Management System (MySQL, Oracle)

else

if Relationships are very important

Graph based Database (similar like above)

 

Reference:

NoSQL Databases: An Overview

SQL/NoSQL How to choose ?

 

Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.
Advertisements
Posted in CS Research&Application, Uncategorized | Tagged , , | Leave a comment

Consistent Hashing Learning Notes

cache

Why?

In most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation. (less keys to update when add or remove nodes, easier to scale up/down)

What?

Consistent hashing is a special kind of hashing such that when a hash table is resized, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots.

How?

Consistent hashing is based on mapping each object to a point on the edge of a circle (or equivalently, mapping each object to a real angle). The system maps each available machine (or other storage bucket) to many pseudo-randomly distributed points on the edge of the same circle.

To find where an object should be placed, the system finds the location of that object’s key on the edge of the circle; then walks around the circle until falling into the first bucket it encounters (or equivalently, the first available bucket with a higher angle). The result is that each bucket contains all the resources located between each one of its points and the previous points that belong to other buckets.

If a bucket becomes unavailable (for example because the computer it resides on is not reachable), then the points it maps to will be removed. Requests for resources that would have mapped to each of those points now map to the next highest points. Since each bucket is associated with many pseudo-randomly distributed points, the resources that were held by that bucket will now map to many different buckets. The items that mapped to the lost bucket must be redistributed among the remaining ones, but values mapping to other buckets will still do so and do not need to be moved.

A similar process occurs when a bucket is added. By adding new bucket points, we make any resources between those and the points corresponding to the next smaller angles map to the new bucket. These resources will no longer be associated with the previous buckets, and any value previously stored there will not be found by the selection method described above.

 

Reference:

Wikipedia: Consistent Hashing

Consistent hashing

The Simple Magic of Consistent Hashing

Improving load balancing with a new consistent-hashing algorithm

Consistent Hashing in Cassandra

Why do people use virtual nodes to get a load balance in consistent hashing implementation?

 

Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.
Posted in CS Research&Application, Uncategorized | Tagged , , | Leave a comment

Java Multi Thread Basics 2

c82-java-prog-logo

1. Singleton Design Pattern

check null after get lock to avoid instantiate multiple times (other thread passed first null check and are waiting for the lock)

2. Producer and Consumer problem is not like Read and Write, it doesn’t allow multiple consumers consume at same time like reader read, because reading doesn’t change resources, but consuming does (decrease resource number) .

3. Constant pool and thread pool are designed to improve performance (it takes time to create and destroy thread) and security (easy to manage and schedule threads).

4. newCachedThreadPool() is better for short and async tasks like DB access, newFixedThreadPool(int nThread) is better for long and stable tasks like file I/O.

5. There is no absolute right answer for Relational DB or NoSQL, Object Oriented Programming or Functional Programming. The trend is to converge, like Java adds lambda and async features, JavaScript adds collection and class.

6. Override (same name, same signature, different implementation) vs Overload (same name, different signature, different implementation)

7. Interview is not exam, transfer from inactive side (don’t know) to active one (what I have done, what I learned. Confident, happy, enjoy, bold).

8. Executor Thread Pool vs Fork Join Framework (Pool) — Divide & Conquer, Recursion, CPU intensive

Work stealing algorithm: definite processes handle indefinite tasks.

fork(): arrange to execute this task asynchronously in the pool.

join(): return the result of the computation when it’s done.

9. Optimization: Java Program -> Java Compiler -> JVM -> OS -> Hardware

Do one (your own) thing, do it well. Trust others.

10. Big Data problems:

Big data skills: Scala, Map/Reduce, Hadoop

Multi-Thread, Java

11. Small / Mid-size company – Family environment, PHD leader, talk and learn everything

Big / Giant company, streamline, nobody

12. System design: load balancer (consistent hashing), CAP

13. Intellij IDEA, jconsole, jstat, jmap

Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.
Resources:

Core Java Volume I–Fundamentals (10th Edition) (Core Series)

Core Java, Volume II–Advanced Features (10th Edition) (Core Series)

Test-Driven Java Development

Java Concurrency in Practice

Java: An Introduction to Problem Solving and Programming (7th Edition)

Java 9 for Programmers (Deitel Developer Series)

Java SE8 for the Really Impatient: A Short Course on the Basics (Java Series)

Core Java for the Impatient

Java: The Beginners Guide for every non-programmer which will attend you trough your learning process

Java Deep Learning Essentials

Machine Learning in Java

Learning Reactive Programming With Java 8

Java 9 Programming By Example

Thinking in Java (4th Edition)

The Java EE Architect’s Handbook, Second Edition: How to be a successful application architect for Java EE applications

Java Artificial Intelligence: Made Easy, w/ Java Programming

Posted in CS Research&Application, Uncategorized | Tagged , , | Leave a comment

Programming Retrospection (Aug 4)

wpid-Programming-Wallpaper-3

1. If the problem requirement is very complex, try to convert to simpler one. (In industry, you always get business requirement which is not efficient in technology, you should convert to a tech efficient algorithm/requirement)

标准复杂,就转换标准(蓄水池问题)

在工业界拿到的商业需求可能不是技术上的最优解,这种情况下就要优化需求(把商业需求转换成技术需求)。

2. Tow pointers always match to singularity (like window elements in an array).

双指针对应着单调性(如一些数组题里窗口内的状况)。

3. Sub-array problem is always related to the situation in the end.

子数组往往联系着结尾处的状况。

4. Array compression skills (like max sum of sub-matrix problem)

数组压缩(矩阵中最大子矩阵之和问题,根据行列大小进行优化,O(m^2*n) vs O(m*n^2) )。

5. It’s not true that you cannot use binary search if the array is not sorted, but if the matched value is in one half.

不一定有序才可以用二分,只要确定其中一半有符合条件的值。

6. One n^2 rectangle has n^4 sub-rectangle (n^2 start * n^2 end) and n^3 sub-square (n^2 start * n length).

一个长方形(n^2)中有(n^4)种子长方形,有(n^3)种子正方形。

Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.
Resources:

Core Java Volume I–Fundamentals (10th Edition) (Core Series)

Core Java, Volume II–Advanced Features (10th Edition) (Core Series)

Test-Driven Java Development

Java Concurrency in Practice

Java: An Introduction to Problem Solving and Programming (7th Edition)

Java 9 for Programmers (Deitel Developer Series)

Java SE8 for the Really Impatient: A Short Course on the Basics (Java Series)

Core Java for the Impatient

Java: The Beginners Guide for every non-programmer which will attend you trough your learning process

Java Deep Learning Essentials

Machine Learning in Java

Learning Reactive Programming With Java 8

Java 9 Programming By Example

Thinking in Java (4th Edition)

The Java EE Architect’s Handbook, Second Edition: How to be a successful application architect for Java EE applications

Java Artificial Intelligence: Made Easy, w/ Java Programming

 

 

Posted in CS Research&Application, Uncategorized | Tagged , , | Leave a comment

Java Multi Thread Basics

c82-java-prog-logo

1. Textbook questions (Summary – Divide – Summary)

Process (application, program) vs Thread (share memory)

Concurrency(progress at same time, some fast and others slow, single core) vs Parallelism (execute at same time, multi-core)

Runnable (interface) vs Threads (class)

 

2. Classic function interface: runnable, comparator

 

3. Class {

 Data field (property)

 Behavior (method)

}

 

4. Runnable is more flexible since it is a interface and can extend another class or implement other interfaces (Java has no multiple inheritance to avoid diamond problem)

 

5. Join: manage execution schedule (suspend parent thread)

 

6. notify() / notifyAll() -> ready

 

7. Synchronize – lock (class and instance level)

 

8. Reentry case (object lock for set and get methods, get / set can call set / get method)

 

9. System.out.println() and Debugger don’t help multi-thread problem debugging (it only prints relative order, print() is a thread, Debugger will hang all threads, need experience, small block)

 

10. Producer Consumer problem

BlockQueue

 

11. JVM

Heap: 1 (x 1), shared public resource, object

Stack: multiple (x N), thread – stack, method – stack frame, object reference

Method Area: constant pool

 

12. Garbage Collection (GC)

What, how, why?

Advantage vs Disadvantages

Memory Leak (example, algorithm)

 

 

Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.
Resources:

Core Java Volume I–Fundamentals (10th Edition) (Core Series)

Core Java, Volume II–Advanced Features (10th Edition) (Core Series)

Test-Driven Java Development

Java Concurrency in Practice

Java: An Introduction to Problem Solving and Programming (7th Edition)

Java 9 for Programmers (Deitel Developer Series)

Java SE8 for the Really Impatient: A Short Course on the Basics (Java Series)

Core Java for the Impatient

Java: The Beginners Guide for every non-programmer which will attend you trough your learning process

Java Deep Learning Essentials

Machine Learning in Java

Learning Reactive Programming With Java 8

Java 9 Programming By Example

Thinking in Java (4th Edition)

The Java EE Architect’s Handbook, Second Edition: How to be a successful application architect for Java EE applications

Java Artificial Intelligence: Made Easy, w/ Java Programming

Posted in CS Research&Application, Uncategorized | Tagged , | Leave a comment

How to design a Big Data Platform

what-is-big-data

Data Producer: Rider / Driver App, API / Services, Dispatch (GPS logs), Map Services

Data Product (Uber Surge Pricing – high in traffic, low in free time)

Real time pipeline (Kafka -> Surge / ELK / Storm / Samza) -> Mobile App / Debugging / Real time, fast analytics / Alerts, Dashboards

Batch pipeline (Hadoop, Vertica) -> Application Data Science / AdHoc Exploration / Analytics Reporting

 

Things Simple Works

User <—> app <—> DB

 

Heating Up

User <—> proxys <—> apps <—> DBs

 

Unbalanced Requirements

Inside app (CPU / Mem / IO)

 

The SOA

User Authentication

News Feed / Timeline

 

Failure is Normal (Network / Hardware)

 

Computer Power (CPU) / Network bandwidth / Hard-drive disk space is limited on a single server

Failure always exists (Network / Hardware)

 

When to consider to scale?

Hardware usage is over 90% and request number is going up

Bottleneck (e.g. QPS > 1s, memory, network bandwidth, IO)

 

Data Pipeline (core piece of infrastructure that carries all data in the company)

Loading / Receiving incoming data

Data Storage

Data computation

 

Distributed system is a must have

High scalability

 

SMACK = Spark + Mesos (Resource Management) + Akka (Reactive Programming framework, event/data driven) + Cassandra (data storage) + Kafka (receiving message / data transfer)

 

Cloud service is more and more powerful, will I care less about infrastructure management?

If you are not familiar with infrastructure, you might abuse using the API service (e.g. debug is always turned off on production environment, otherwise might block requests on app or DB server)

 

Common Data Pipeline

Data ingestion layer

Data storage layer

Data computation layer

Cluster scheduling layer

 

Data ingestion layer

High Throughput

Merely pass through

Simple process logic

Cannot serve as a storage layer

 

Kafka (Fast – MB/s to thousands clients / Scalable / Durable – disk) – distributed messaging system (like post service, USPS / UPS / Shunfeng)

 

Data Storage Layer

Cassandra (high availability / no single point of failure) – distributed storage system (split data into different servers, centralized store of metadata)

 

DB – store and search (index) data

 

Data is calculated into a ring, each server takes a portion (consistent hashing)

 

Data computation layer

High available / Fast computation / Able to handle spike traffic / Retry on failure

 

Hadoop

Distributed calculation / Split any calculation into steps (Map / Reduce)

Think of 3 * 3 = 3 + 3 + 3

Spark (open source cluster computing framework)

Respond of limitation of Hadoop

Computation optimization

In memory computing (fast, expensive, but cannot process too much data like TB or PB)

 

Cluster scheduling layer

 

Deployment Consideration

Runs on Cloud Service for scalability (AWS / Google Cloud Service)

Always setup monitoring for improvement

 

Latency Numbers Every Programmer Should Know

 

Region and Availability Zones

Region is a Geolocation where AWS data center locates (us-north virginia, eu-west Ireland)

Every region has multiple availability zones (for high availability reasons, earthquake, tornado, etc)

 

MongoDB (simple, small data)

Cassandra (higher requirement on QPS)

 

Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.
Resources:

Core Java Volume I–Fundamentals (10th Edition) (Core Series)

Core Java, Volume II–Advanced Features (10th Edition) (Core Series)

Test-Driven Java Development

Java Concurrency in Practice

Java: An Introduction to Problem Solving and Programming (7th Edition)

Java 9 for Programmers (Deitel Developer Series)

Java SE8 for the Really Impatient: A Short Course on the Basics (Java Series)

Core Java for the Impatient

Java: The Beginners Guide for every non-programmer which will attend you trough your learning process

Java Deep Learning Essentials

Machine Learning in Java

Learning Reactive Programming With Java 8

Java 9 Programming By Example

Thinking in Java (4th Edition)

The Java EE Architect’s Handbook, Second Edition: How to be a successful application architect for Java EE applications

Java Artificial Intelligence: Made Easy, w/ Java Programming

 

Posted in CS Research&Application, Uncategorized | Tagged , | Leave a comment

How to design GFS/BigTable/MapReduce

c82-java-prog-logo

How to design GFS/BigTable/MapReduce

Foundation of Big Data Age
Interviewer: design search engine
How to read a paper? Find a suitable solution under your scenario instead of recite details
What is the layers of search engine system?
Layers of system
Application 1 / Application 2 / Application 3
Algorithm (MapReduce)
Data model (BigTable)
File system (GFS)
Let’s design from bottom up
Google File System
How to save a file? (how to read or write data from or into file?)
What’s the core idea? (如果删除了什么就不是它自己,就像文件系统,如果去掉数据读写功能,就不是文件系统了)
Key point
1 block = 1024 Byte
Faster to search (why separate file info and data block)
What’s the challenge? (tons of data) => evolve
How to save a big file? (index will explode)
Key point
1 chunk = 64MB = 64 * 1024 = 65,536 blocks (larger chunk)
Advantage
Reduced size of metadata
Reduce traffic
Disadvantage
Waste space for small files
How to save an extra-large file?
a single machine cannot save the file
面试时候面试官有时候不会问很多问题,自己怎么练习一步步引导自己逐渐完善自己的设计?就像上课一样你问的那些关键问题。
简单说就是怎么练习想到那些关键问题?
always ask why?
而且怎么验证提出的问题就是关键问题?最终又怎么验证自己的整体设计是合理的,在面试时候会通过的?有什么样的具体标准吗?
read high scaling blogs, make up logic behind
System Design goal? 如何系统性地设计
Big Table
How to support lookup and range query on a file?
How to save a large table?
Big Table (memory intensive) + GFS (disk intensive)
MapReduce
Divide + Assemble
Input -> Split -> Map -> Shuffle -> Reduce -> Finalize
How to compare MapReduce vs Spark? Is MapReduce less popular than Spark? Do I need to study Spark? How to follow up new tech?
MapReduce is more about idea, Spark is about framework and implementation, faster and more powerful, that’s why more and more people use Spark
Donate $5 to me for a coffee with PayPal and read more professional and interesting technical blog articles about web and mobile development. Feel free to visit my web app, WhizWallet, to apply for credit, store or gift cards, DealsPlus to browse daily deals and store coupons to save money.
Follow me @Yaoli0615 at Twitter to get latest tech updates.
Resources:

Core Java Volume I–Fundamentals (10th Edition) (Core Series)

Core Java, Volume II–Advanced Features (10th Edition) (Core Series)

Test-Driven Java Development

Java Concurrency in Practice

Java: An Introduction to Problem Solving and Programming (7th Edition)

Java 9 for Programmers (Deitel Developer Series)

Java SE8 for the Really Impatient: A Short Course on the Basics (Java Series)

Core Java for the Impatient

Java: The Beginners Guide for every non-programmer which will attend you trough your learning process

Java Deep Learning Essentials

Machine Learning in Java

Learning Reactive Programming With Java 8

Java 9 Programming By Example

Thinking in Java (4th Edition)

The Java EE Architect’s Handbook, Second Edition: How to be a successful application architect for Java EE applications

Java Artificial Intelligence: Made Easy, w/ Java Programming

Posted in CS Research&Application, Uncategorized | Tagged , , , | Leave a comment