...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Tip |
---|
The Framework proposed in this space (Alex Xu) is not exactly applied to propose a design : Getting started - a framework to propose... |
On this page.
Table of Contents | ||||
---|---|---|---|---|
|
Introduction
A key-value store (KVS) referred to a key-value database (KVD), is a non-relational database.
Unique identifier is stored as a key with its associated value.
Key must be unique and value is accessed through key.
Part 1 - Understand the problem and establish design scope
Size of a KV pair | Small : less than 10 KB |
---|---|
Ability to store big data ? | Yes. |
High Availability: system responds quickly even during failures ? | Yes. |
High Availability: system can be scaled to support large data set ? | Yes. |
Automatic scaling: adding / deleting servers based on traffic ? | Yes. |
Tunable consistency ? | Yes. |
Low latency ? | Yes. |
Part 2 - KVS in Single Server
...
Even with the optimisations (data compression, frequently used data in memory and rest on disk), a distributed KV store is required for big data.
Part 3 - KVS in Distributed Environment
...
Real-world distributed systems
Partitions cannot be avoided. We have to choose between consistency and availability !
If consistency over availability because high consistent requirements = system could be unavailable.
if availability over consistency = system keeps accepting reads even though it returns stale data.
How to store & fetch data ? (data partition)
We can use the Consistent Hashing technic. Using this technic, gives 2 advantages :
Automatic scaling: servers could be added and removed automatically depending on the load.
Heterogeneity: number of virtual nodes (or replicas) for a server is proportional to the server capacity.
How to replicate data ? (data replication)
To achieve HA and reliability, data must be replicated asynchronously over N servers where N is a configurable parameter.
The replication is useful for preserving the data in a node. For replicating a key-value, we'll fetch next 2 (or 3, depending on replication factor / number of copies) distinct server nodes from the hash ring and save the copies into these nodes.
How to handle consistency ?
Since data is replicated, it must be synchronized. Issue is configuration is a typical tradeoff between latency and consistency.
...
CONFIGURATION OF W (WRITE OPERATION), R (READ OPERATION) AND N SERVERS (REPLICAS OR NODES) | |
---|---|
if W or R > 1 | better consistency; but slower because coordinator must wait for response from slowest replica or node. |
if W + R > N | strong consistency is guaranteed because one overlapping node has latest data to ensure consistency. |
if R = 1 and W = N | optimized for fast read. |
if W = 1 and R = N | optimized for fast write. |
if W + R <= N | strong consistency is NOT guaranteed |
...
Versioning is the solution to resolve inconsistency. Versioning means treating each data modification as a new immutable version of data.
2 servers have the same value, but changes are performed at the same time in the both servers (conflicting values).
...
Example of increment when there is a conflicting value, increment with a vector block.
if D3([Sx,2],[Sy,1]) et D4([Sx,2],[Sz,1]), then
then, D5([Sx,3],[Sy,1],[Sz,1]) → Value reconciled and written by Sx.
How to handle failures ?
Step 1 - Detect the failures with Gossip Protocol method
Each node maintains a node membership list (containing member ID & heartbeat counters).
Each node periodically increments its counter.
Each node periodically sends heartbeats to a set of random nodes which in turn propagate to another
Once node receives heartbeats, membership list is updated to the latest info.
If heartbeat has not increased for more than predefined periods, member is considered as offline.
...
Step 2 - Handle the failures
What if a node is temporary unavailable ? Implement the SLOPPY QUORUM method.
...
What if a node is permanently unavailable ? Implement the ANTI-ENTROPY PROTOCOL method to keep replicas in sync.
It involves comparing each piece of data on nodes and updating each node to newest version. A Merkle tree is used for inconsistency detection and minimize the amount of data transferred.
...
What tasks are performed by the nodes ?
Write path
...
The write request is persisted on a commit log file.
Data is saved in the memory cache.
When the memory cache is full or reaches o predefined threshold, data is flushed to SSTable on disk (a sorted-string table).
Read path
...
The system checks if data is in memory. If not,…then step 2.
If data is not in memory, the system checks the bloom filter.
The bloom filter is used to figure out which SSTable might contain the key.
The result of the data set is returned to the client.
Part 4 - Summary
Technique | Goal/Problem |
---|---|
Consistent Hashing | Ability to store big data |
Dataset partition | |
Incremental scalability | |
Heterogeneity | |
Data replication | High Availability READS |
Versioning and conflict resolution with vector clock | High Availability WRITES |
Quorum consensus | Tunable consistency |
Sloppy quorum | Handle temporary failures |
Merkle tree | Handle permanent failures |
Cross-datacenter replication | Handle data center outage |