How to design a key-value store

Part 1 - Understand the problem and establish design scope

Size of a KV pair	Small : less than 10 KB
Ability to store big data ?	Yes.
High Availability: system responds quickly even during failures ?	Yes.
High Availability: system can be scaled to support large data set ?	Yes.
Automatic scaling: adding / deleting servers based on traffic ?	Yes.
Tunable consistency ?	Yes.
Low latency ?	Yes.

Part 3 - KVS in Distributed Environment

Real-world distributed systems

Partitions cannot be avoided. We have to choose between consistency and availability !

If consistency over availability because high consistent requirements = system could be unavailable.
if availability over consistency = system keeps accepting reads even though it returns stale data.

How to store & fetch data ? (data partition)

We can use the Consistent Hashing technic. Using this technic, gives 2 advantages :

Automatic scaling: servers could be added and removed automatically depending on the load.
Heterogeneity: number of virtual nodes (or replicas) for a server is proportional to the server capacity.

How to replicate data ? (data replication)

To achieve HA and reliability, data must be replicated asynchronously over N servers where N is a configurable parameter.

The replication is useful for preserving the data in a node. For replicating a key-value, we'll fetch next 2 (or 3, depending on replication factor / number of copies) distinct server nodes from the hash ring and save the copies into these nodes.

How to handle consistency ?

Since data is replicated, it must be synchronized. Issue is configuration is a typical tradeoff between latency and consistency.

CONFIGURATION OF W (WRITE OPERATION), R (READ OPERATION) AND N SERVERS (REPLICAS OR NODES)
if W or R > 1	better consistency; but slower because coordinator must wait for response from slowest replica or node.
if W + R > N	strong consistency is guaranteed because one overlapping node has latest data to ensure consistency.
if R = 1 and W = N	optimized for fast read.
if W = 1 and R = N	optimized for fast write.
if W + R <= N	strong consistency is NOT guaranteed

See the consistency models - page References & Glossary for Key-Value store.

Versioning is the solution to resolve inconsistency. Versioning means treating each data modification as a new immutable version of data.

2 servers have the same value, but changes are performed at the same time in the both servers (conflicting values).

Example of increment when there is a conflicting value, increment with a vector block.

if D3([Sx,2],[Sy,1]) et D4([Sx,2],[Sz,1]), then

then, D5([Sx,3],[Sy,1],[Sz,1]) → Value reconciled and written by Sx.

How to handle failures ?

Step 1 - Detect the failures with Gossip Protocol method

Each node maintains a node membership list (containing member ID & heartbeat counters).
Each node periodically increments its counter.
Each node periodically sends heartbeats to a set of random nodes which in turn propagate to another
Once node receives heartbeats, membership list is updated to the latest info.
If heartbeat has not increased for more than predefined periods, member is considered as offline.

Step 2 - Handle the failures

What if a node is temporary unavailable ? Implement the SLOPPY QUORUM method.

What if a node is permanently unavailable ? Implement the ANTI-ENTROPY PROTOCOL method to keep replicas in sync.

It involves comparing each piece of data on nodes and updating each node to newest version. A Merkle tree is used for inconsistency detection and minimize the amount of data transferred.

What tasks are performed by the nodes ?

Write path

The write request is persisted on a commit log file.
Data is saved in the memory cache.
When the memory cache is full or reaches o predefined threshold, data is flushed to SSTable on disk (a sorted-string table).

Read path

The system checks if data is in memory. If not,…then step 2.
If data is not in memory, the system checks the bloom filter.
The bloom filter is used to figure out which SSTable might contain the key.
The result of the data set is returned to the client.

Part 4 - Summary

Technique	Goal/Problem
Consistent Hashing	Ability to store big data
	Dataset partition
	Incremental scalability
	Heterogeneity
Data replication	High Availability READS
Versioning and conflict resolution with vector clock	High Availability WRITES
Quorum consensus	Tunable consistency
Sloppy quorum	Handle temporary failures
Merkle tree	Handle permanent failures
Cross-datacenter replication	Handle data center outage

Introduction