How to design a Chat System
Introduction
A chat performs different functions for different people. It is important to explore the feature requirements.
Step 1 - Understand the problem and establish design scope
Kind of chat ? 1 to 1 or group chat ? | both |
---|---|
Mobile or Web app ? or both ? | both |
Scale of the app ? start-up or massive scale ? | massive scale : 50 millions per day of active users. |
Limit for a group chat ? | Max 100 people = small group. |
Features ? Support attachment ? | 1 on 1 chat, group chat, online indicator, ONLY supports textes messages. |
Limit of text message ? | Less than 100 000 characters long. |
End-to-end encryption required ? | Not required for now, but maybe. |
Chat history ? | Forever |
Push notifications ? | Yes. |
Online presence ? | Yes. |
Multiple device support ? | Yes, can be logged in multiple accounts at the same time. |
Step 2 - High-level design
For a chat service, the choice of the network protocols is important : HTTP connection could be a good option on the server-side, but the problem occurs on the client-side. There are 3 techniques to simulate a server-initiated connection: polling, long polling and WebSockets.
See polling & long polling in the page References & Glossary for chat system.
WebSockets is the most common solution for sending asynchronous updates from server to client. WebSockets (WS) is used for both sender and receiver sides.
High-level shows 3 major categories : stateless services, stateful services and third-party integration. High-level architecture is already scalable because a single server design is a deal breaker (single point of failure).
Client maintains a persistent WebSocket connection to a chat server for real-time messaging :
Chat servers facilitate message sending/receiving.
Presence servers manage online/offline status.
API Servers handle user login, signup, change profil, etc.
Notification servers send push notifications.
KV Store to store chat history : when offline, user see all previous chat history. See the page References & Glossary for chat system - Storage.
Step 3 - Design deep dive
Service Discovery
User A tries to log in to app.
The LB sends the login request to API Servers.
After backend authenticates the user, service discovery finds best chat server for User A.
User A connects to chat server through WebSocket.
Message Flows (Message synchronization across devices & group chat flow)
1 on 1 chat flow
User A sends a chat message to Chat server 1
Chat server 1 obtains a message ID from ID generator
Chat server 1 sends message to Message Queue (Sync)
Message is stored in KV Store.
…
if User B is online = message is forwarded to chat server 2 where User B is connected.
if User B is offline = a push notification is sent from push notification servers.
Chat server 2 forwards message to User B.
Message Synchronization
Each device maintains a variable called “cur_max_message_id” which keeps tracks of the latest message ID on the device.
To have new message, 2 conditions :
recipient ID is equal to currently logged-in user ID.
message ID in KV store is larger than “cur_max_message_id”.
With distinct cur_max_message_id on each device, synchro is easy as each device get new messages from KV store.
Small group chat flow
This simplifies message sync flow as each client only needs to check its own inbox to get new messages.
When group number is small, storing a copy in each recipient’s inbox is not too expensive.
On the recipient side, a recipient can receive messages from multiple users (see diagram below).
Online Presence
This indicator is an essential feature of many chat applications.
User login
The user login flow is explained in the “Service Discovery” section : So a WebSocket is built between client and real-time services.
2 variables are stored in KV Store : online status & last_active_at timestamp.
User logout
User disconnection
When user disconnects from internet, the persistent connection between the client and server is lost. We cannot update statuts on every disconnect/reconnect, it’s creating a poor user experience.
Implementation of “heartbeat event” : sending an event every x seconds.
How do user’s friends know about the status changes ?
Presence servers use a publish-subscribe model in which each friend pair maintains a channel.
Step 4 - Pros & Cons
Pros: Decoupled architecture, real-time communication.
Cons: Extend the app to media files (photos, …); end-to-end encryption not added; caching messages on client-side is more effective to reduce data transfer between client and server; improve load time (with caching); error handling (chat server error with zookeeper service, message resent mechanism with retry technic)