How to design Consistent Hashing
Introduction - What is Hashing ?
A hash function is a function that maps one piece of data—typically describing some kind of object, often of arbitrary size—to another piece of data, typically an integer, known as hash code, or simply hash.
Hashing is widely used in algorithms, data structures, and cryptography.
Hash Functions
Hash functions take variable-length input data and produce a fixed-length output value.public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } return h; } public static void main(String[] args) { String anotherTest = "java"; String test = "test"; String oneMoreTest = "dev"; System.out.println(test.hashCode()); // output: 3556498 System.out.println(test.hashCode()); // output: 3556498 System.out.println(anotherTest.hashCode()); // output: 3254818 System.out.println(oneMoreTest.hashCode()); // output: 99349 }
Hashing Algo
Hashing is a technique or process of mapping keys, values into the hash table by using a hash function : Data Structure and Algorithms - Hash Table (tutorialspoint.com)Cryptographic Hash Functions
Cryptographic hash functions are a specialized group of hash functions. They provide an increased level of security. Thus, they are used for cryptography purposes like password verification, data integrity validation, blockchain (cryptocurrencies).
The Rehashing problem : How to balance the load to x servers ?
Why do we need Hashing first ?
But, hashing is a common way to balance the load to servers : serverIndex = hash(key) % N where N is the size of the server pool. (See an example in the page : the hashes for 4 servers - Hash % 4 servers).
But, the problem is when the servers pool is not static anymore ! New servers are added and existing servers could be removed.
If we removed 1 server, modulo changes : Hash % 3 servers. Most of keys are redistributed.
But, when 1 server goes offline, most cache clients will connect to the wrong servers to fetch data. We have an horizontal scalability problem.
The solution : Consistent hashing (CH)
How does CH work ?
CH facilitates the distribution of data across a set of nodes (DB servers) in such a way that minimizes the mapping or reorganization of data when nodes are added or removed.
1- What is CH ? Creating th Hash Key Space.
We need to consider a hash function that generates integer hash values in the following range [0, 2^32-1] and the output of a hash function is: x0, x1, x2,…,xN.
2- Representing the hash space as a hash ring
By connecting both ends, we get a hash ring. So, we can visualize it as a ring.
3- Hash servers : Placing DB servers in the hash ring
We can use the hash function and so, hash the servers on their IP address to map them to different integers.
4- Hash Keys: Determining placement of keys on servers
To find which DB server an incoming key resides on : we assume we have 4 incoming keys (key0, key1, key2, key3) and none of them directly maps the hash value of any 4 servers on our ring. We follow the hand of a clock and insert the key.
5- Adding a server in the ring
When we add a server, we need to add it between server 0 and server 3. We'll need only to remap the keys to the server 4 (new server).
So, we need to remap only k/n keys where k is the number of keys (4) and n is the number of servers (5).
6- Removing a server from the ring
A server might go down and our consistent hashing scheme ensures that it has minimal effect on the number of keys and servers affected. If server 0 goes down, only the keys between server 3 and server 0 will need to be remapped to server 1 : here, we have 2 keys to consider and the rest of the keys are unaffected.
Consistent hashing has successfully solved the horizontal scalability problem by ensuring that every time we scale up or down, we DO NOT have to re-arrange all the keys or touch all the database servers !
Two issues in the basic approach
Impossible to keep the size of partitions on the ring for all servers considering a servers can be added or removed. If a server goes down, the load seen by the server immediately following the failed server will be higher.
It’s more complicated because data does not have uniform distribution in most cases.
Solution to the issues related to CH : Introduction of replicas or virtual nodes
The server 0 could have 2 replicas and the server 1 will one replica.
Now, the server 0 is responsible for 50% keys and the server 1 as well.
Scenarios to consider for HC
You have a cluster of databases and you need to elastically scale them up or down based on traffic load. For example, add more servers during peaks to handle the the extra traffic.
You have a set of cache servers that need to elastically scale up or down based on traffic load.
Benefits of HC
Enables Elastic Scaling of cluster of database/cache servers
Facilitates Replication and partitioning of data across servers
Partitioning of data enables uniform distribution which relieves hot spots