How to design a unique ID generator in distributed systems

Step 1 - Understand the problem and establish design scope

Characteristics of unique IDs ?	IDs = unique & sortable
Does ID increment by 1 for each new record ?	ID increments by time, but not necessarily by 1.
IDs contain numerical values ?	Contain numerical values.
ID length requirement ?	IDs should fit into 64-bit.
Scale of the system ?	Able to generate 10000 IDs per second.

Step 2 - High-level design

Traditional approach - Non-Distributed System

Approach uses a shared counter, the latter increases at each call.

We can also, generate un ID as a timestamp function.

Shared counter and timestamp function are limited because generate issues :

multiple independent servers can generate the same ID;
the same ID generates for 2 consecutives requests.

Approaches in Distributed system6

There are probably 7 options / approaches and here’s below an overall comparison table below.

	Uuid/Guid	MongoDB	Centralize MySQL (Flick - Ticket Server)	MySQL : Cluster	Twitter Snowflake	Snowflake	Baiduuid Generator
Ordered
Scale
Length (Bits)
Uniqueness
Simplicity
Geo-latency
Indexable

UUID

Each Web server could contain an ID Generator & the Web Server is responsible for generating IDs independently.

Pros & Cons for step 1

Pros : generating is simple (no coordination between servers and so, no synchronization issues); easy to scale in each web servers; uniqueness;
Cons : IDs are 128 bits long & we need 64 bits; IDs do not go up with time; IDs could be non-numeric.

CENTRALIZE MYSQL (FLICKR) - TICKET SERVER

Ticket Server will generate distributed primary keys. The idea is to use a centralized auto_increment features in a single DB server (Ticket Server).

Pros & Cons for step 1

Pros: Numeric IDs. Easy to implement and works for small to medium scale apps.
Cons: single point of failure, but we can scale, but will introduce more challenges.

MySQL CLUSTER - Multi-Master Replication

Uses the databases' auto_increment feature. Instead of increasing the next ID by 1, we increase it by K where K is the number of DB servers in use.

Next ID = the previous ID + 2.

Pros & Cons for step 1

Pros: simple to implement.
Cons: hard to scale with multiple data centers; IDs do not go up with time across multiple servers; it does not scale well when a server os added or removed.

TWITTER SNOWFLAKE

Twitter snowflake is a dedicated service for generating 64-bit unique identifiers used in distributed computing for objects within Twitter such as Tweets, Direct Messages, Lists, etc. These IDs are unique 64-bit unsigned integers, which are based on time.

The full IDs are made up of the following components :

Epoch timestamp in a millisecond — 41 bits (gives us 69 years for any custom epoch)
Configured machine/node/shard Id — 10 bits (gives us up to total of 2 i.e 1024 Ids)
Sequence number — 12 bits (A local counter per machine that sets to zero after every 4096 values)
In the beginning, the extra one reserved bit is set as 0 to make the overall number positive.

Pros & Cons for step 1

Pros: TIMESTAMP is sortable, fit into 64 bits, high availability is easily supported.
Cons: N/A

Clock synchronization : all ID generation servers could have the same clock to generate the timestamp, which might not be accurate in distributed systems. In reality, system clocks can drastically skew in distributed systems.

MONGODB

MongoDB uses ObjectIds as the default value of the _id field of each document, which is generated while creating any document. More Detail
ObjectID is a 96-bit number that is composed as follows:

a 4-byte timestamp value representing the seconds since the Unix epoch (which will not run out of seconds until the year 2106)
a 5-byte random value, and
a 3-byte incrementing counter, starting with a random value.

BAIDU UID GENERATOR

UID generator is developed by the Baidu technology department and implemented based on the snowflake algorithm. Unlike the original snowflake algorithm, It works as a component and allows users to override workID bits and initialization strategy. Suitable for virtualization environments such as DOCKER.

A unique ID (long) of 64 bits can be generated.

Sign (1bit) fixed 1bit symbol ID, that is, the generated id is a positive number.
delta seconds (28 bits)

Current time, relative to the incremental value of time base point “2016-05-20”, unit: second, can support up to 8.7 years

Worker ID (22 bits) machine ID, which can support up to 420W machine starts. The built-in implementation is allocated by the database at startup. The default allocation policy is to discard after use, and reuse policy can be provided later.
Sequence (13 bits) is a concurrent sequence per second. 13 bits can support 8192 concurrent sequences per second.

SONYFLAKE

Sonyflake is a distributed unique ID generator inspired by Twitter’s Snowflake. Sonyflake focuses on lifetime and performance in many host/core environments. So it has a different bit assignment from Snowflake.
A Sonyflake ID is composed of

39 bits for a time in units of 10 msec
8 bits for a sequence number
16 bits for a machine id

As a result, Sonyflake has the following advantages and disadvantages:

The lifetime (174 years) is longer than that of Snowflake (69 years)
It can work in more distributed machines (²¹⁶) than Snowflake (²¹⁰)
It can generate ²⁸ IDs per 10 msec at most in a single machine/thread (slower than Snowflake)

Step 3 - Design deep dive

Alex Xu chooses the Twitter Snowflake approach because of the requirements established above.

At the startup time, datacenter ID & machine ID are chosen. Time stamp and sequence numbers are generated when the ID generator is running.

Timestamp

41 bits make up the timestamp section : Timestamp grows with time and is sortable by time. See page References & Glossary.

Sequence Number

It's 12 bits which gives 2^ 12 = 4096 combinations. This field is 000000000000 unless more than 1 ID is generated in ms on the same server.

Step 4 - Pros & Cons

Pros & Cons : see the table above.

But, we need to discuss more about Clock synchronization, high availability and section length tuning (sequence numbers & timestamp) = Cons.