Part 3 - KVS in Distributed Environment
data:image/s3,"s3://crabby-images/36dce/36dce124ac4ab36a0df8b1688c22099abf4b4050" alt=""
Real-world distributed systems
Partitions cannot be avoided. We have to choose between consistency and availability !
How to store & fetch data ? (data partition)
We can use the Consistent Hashing technic. Using this technic, gives 2 advantages :
Automatic scaling: servers could be added and removed automatically depending on the load.
Heterogeneity: number of virtual nodes (or replicas) for a server is proportional to the server capacity.
How to replicate data ? (data replication)
To achieve HA and reliability, data must be replicated asynchronously over N servers where N is a configurable parameter.
The replication is useful for preserving the data in a node. For replicating a key-value, we'll fetch next 2 (or 3, depending on replication factor / number of copies) distinct server nodes from the hash ring and save the copies into these nodes.
How to handle consistency ?
Since data is replicated, it must be synchronized. Issue is configuration is a typical tradeoff between latency and consistency.
data:image/s3,"s3://crabby-images/67017/67017c66e6dd9c52c1342f43433f9357969edba0" alt=""
CONFIGURATION OF W (WRITE OPERATION), R (READ OPERATION) AND N SERVERS (REPLICAS OR NODES) |
---|
if W or R > 1 | better consistency; but slower because coordinator must wait for response from slowest replica or node. |
if W + R > N | strong consistency is guaranteed because one overlapping node has latest data to ensure consistency. |
if R = 1 and W = N | optimized for fast read. |
if W = 1 and R = N | optimized for fast write. |
if W + R <= N | strong consistency is NOT guaranteed |
See the consistency models - page References & Glossary for Key-Value store.
Versioning is the solution to resolve inconsistency. Versioning means treating each data modification as a new immutable version of data.
2 servers have the same value, but changes are performed at the same time in the both servers (conflicting values).
data:image/s3,"s3://crabby-images/a40f3/a40f3321adebdf36a49bbdd3721aa6014f3fca83" alt=""
Example of increment when there is a conflicting value, increment with a vector block.
if D3([Sx,2],[Sy,1]) et D4([Sx,2],[Sz,1]), then
then, D5([Sx,3],[Sy,1],[Sz,1]) → Value reconciled and written by Sx.
How to handle failures ?
Step 1 - Detect the failures with Gossip Protocol method
Each node maintains a node membership list (containing member ID & heartbeat counters).
Each node periodically increments its counter.
Each node periodically sends heartbeats to a set of random nodes which in turn propagate to another
Once node receives heartbeats, membership list is updated to the latest info.
If heartbeat has not increased for more than predefined periods, member is considered as offline.
data:image/s3,"s3://crabby-images/d5cad/d5cad656a49f548799898a173f262df99050b329" alt=""
Step 2 - Handle the failures
What if a node is temporary unavailable ? Implement the SLOPPY QUORUM method.
data:image/s3,"s3://crabby-images/3a780/3a7803580bf6f26b16f55d37c37095a316a23c3f" alt=""
What if a node is permanently unavailable ? Implement the ANTI-ENTROPY PROTOCOL method to keep replicas in sync.
It involves comparing each piece of data on nodes and updating each node to newest version. A Merkle tree is used for inconsistency detection and minimize the amount of data transferred.
data:image/s3,"s3://crabby-images/70a74/70a74b5bc7946621c75a1b41beb63692b8b9c79e" alt=""
Write path
data:image/s3,"s3://crabby-images/4bacf/4bacf4265a3be9bb8559ac022e3c811283ae91a8" alt=""
The write request is persisted on a commit log file.
Data is saved in the memory cache.
When the memory cache is full or reaches o predefined threshold, data is flushed to SSTable on disk (a sorted-string table).
Read path
data:image/s3,"s3://crabby-images/5c052/5c0528761f9ca45c7e098058d607f69f0e25e0ec" alt=""
The system checks if data is in memory. If not,…then step 2.
If data is not in memory, the system checks the bloom filter.
The bloom filter is used to figure out which SSTable might contain the key.
The result of the data set is returned to the client.