Data Partitioning / Sharding (Scaling Database)

What are some common Sharding Schemes ?

Strategy 1 - Horizontal or Range Based Sharding

Data is split based on the value ranges that are inherent in each entity.

Disadvantage of this scheme is that the last names of the customers may not be evenly distributed
Benefit of this approach is that it's the simplest sharding scheme available. Each shard also has the same schema as the original database.

Note : It works well for relative non static data !

Strategy 2 - Vertical Sharding

Different features of an entity will be placed in different shards on different machines.

Disadvantages : 1) Depending on your system, your application layer might need to combine data from multiple shards to answer a query (user profile, connections, articles shards…), increase complexity and dev. 2) If your Site/system experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers.
Can handle the critical part of your data (for examples User Profiles) differently from the not so critical part of your data (for example, blog posts) and build different replication and consistency models around it.

Strategy 3 - Key or Hash based Sharding

An entity has a value ( Eg. IP address of a client application) which can be used as an input to a hash function and a resultant hash value generated. This hash value determines which database server(shard) to use.

Disadvantage : main drawback of this method is that elastic load balancing ( dynamically adding/removing database servers) becomes very difficult and expensive. Problem is easily solved by Consistent hashing.

Strategy 4 - Directory based Sharding

Involves placing a lookup service in front of the sharded databases. The lookup service knows the current partitioning scheme and keeps a map of each entity and which database shard it is stored on. The lookup service is usually implemented as a webservice.

Advantage : It enables us to solve the elastic scaling problem mentioned above (hashing DB).

Here's how: In the previous example, we had 4 database servers and a hash function that performed a modulo 4 operation on the application ids. Now, if we wanted to add 6 more database servers without incurring any downtime, we'll need to do the following steps:

Keep the modulo 4 hash function in the lookup service .
Determine the data placement based on the new hash function - modulo 10.
Write a script to copy all the data based on #2 into the six new shards and possibly on the 4 existing shards. Note that it does not delete any existing data on the 4 existing shards.
Once the copy is complete, change the hash function to modulo 10 in the lookup service
Run a cleanup script to purge unnecessary data from 4 existing shards based on step#2. The reason being that the purged data is now existing on other shards.

Disadvantages : 1) While the migration is happening, the users might still be updating their data. Options include putting the system in read-only mode or placing new data in a separate server that is placed into correct shards once migration is done. 2) The copy and cleanup scripts might have an effect on system performance during the migration. It can be circumvented by using system cloning and elastic load balancing - but both are expensive.

What are the common problems with Sharding ?

Database Joins become more expensive and not feasible in certain cases

When you shard the database, joins have to be performed across multiple networked servers which can introduce additional latency for your service.
Application layer also needs additional level of asynchronous code and exception handling which increases the development and maintenance cost.
Cross machine joins may not be an option if you need to maintain high availability SLA for your service. Then, the only option left is to de-normalize your database to avoid cross server joins

Sharding can compromise database referential integrity

Most RDBMS do not support foreign keys across databases on different database servers. This means that applications that require referential integrity often have to enforce it in application code and run regular SQL jobs to clean up dangling references.
If an application must modify data across shards, evaluate whether complete data consistency is actually required. Instead, a common approach in the cloud is to implement eventual consistency.
Application logic must take responsibility for ensuring that the updates all complete successfully, as well as handling the inconsistencies that can arise from querying data.

Database schema changes can become extremely expensive

In some situations as your user base grows, the schema might need to evolve. For example, you might have been storing user picture and user emails in the same shard and now need to put them on different shards. This means that all your data will need to be relocated to new location. This can cause down-times in your system.

A potential solution is to use directory based partitioning or consistent hashing to solve this problem.

What is Sharding or Data Partitioning ?

What scalability issues are solved by Sharding ?