How to design a key-value store

The Framework proposed in this space (Alex Xu) is not exactly applied to propose a design : Getting started - a framework to propose...

Part 1 - Understand the problem and establish design scope

Size of a KV pair	Small : less than 10 KB
Ability to store big data ?	Yes.
High Availability: system responds quickly even during failures ?	Yes.
High Availability: system can be scaled to support large data set ?	Yes.
Automatic scaling: adding / deleting servers based on traffic ?	Yes.
Tunable consistency ?	Yes.
Low latency ?	Yes.

Part 3 - KVS in Distributed Environment

Real-world distributed systems

Partitions cannot be avoided. We have to choose between consistency and availability !

If consistency over availability because high consistent requirements = system coud unavailable.
if availability over consistency = system keeps accepting reads even though it returns stale data.

How to store & fetch data ? (data partition)

We can use the Consistent Hashing technic. Using this technic, gives 2 advantages :

Automatic scaling: servers could be added and removed automatically depending on the load.
Heterogeneity: number of virtual nodes (or replicas) for a server is proportional to the server capacity.

How to replicate data ? (data replication)

To achieve HA and reliability, data must be replicated asynchronously over N servers where N is a configurable parameter.

The replication is useful for preserving the data in a node. For replicating a key-value, we'll fetch next 2 (or 3, depending on replication factor / number of copies) distinct server nodes from the hash ring and save the copies into these nodes.

How to handle consistency ?

Since data is replicated, it must be synchronized. Issue is configuration is a typical tradeoff between latency and consistency.

CONFIGURATION OF W (WRITE OPERATION), R (READ OPERATION) AND N SERVERS (REPLICAS OR NODES)
if W or R > 1	better consistency; but slower because coordinator must wait for response from slowest replica or node.
if W + R > N	strong consistency is guaranteed because one overlapping node has latest data to ensure consistency.
if R = 1 and W = N	optimized for fast read.
if W = 1 and R = N	optimized for fats write.
if W + R <= N	strong consistency is NOT guaranteed

See the consistency models - page References & Glossary for Key-Value store.

Versioning is the solution to resolve inconsistency. Versioning means treating each data modification as a new immutable version of data.

2 servers have the same value, but changes are performed at the same time in the both servers (conflicting values).

Example of increment when there is a conflicting value, increment with a vector block.

if D3([Sx,2],[Sy,1]) et D4([Sx,2],[Sz,1]), then

then, D5([Sx,3],[Sy,1],[Sz,1]) → Value reconciled and written by Sx.

How to handle failures ?

Step 1 - Detect the failures with Gossip Protocol method

Each node maintains a node membership list (containing member ID & heartbeat counters).
Each node periodically increments its counter.
Each node periodically sends heartbeats to a set of random nodes which in turn propagate to another
Once node receives heartbeats, membership list is updated to the latest info.
If heartbeat has not increased for more than predefined periods, member is considered as offline.

Step 2 - Handle the failures

What if a node is temporary unavailable ? Implement the SLOPPY QUORUM method.

What if a node is permanently unavailable ? Implement the ANTI-ENTROPY PROTOCOL method to keep replicas in sync.

It involves comparing each piece of data on nodes and updating each node to newest version. A Merkle tree is used for inconsistency detection and minimize the amount of data transferred.

What tasks are performed by the nodes ?

Write path

Read path

Introduction