Based on https://medium.com/must-know-computer-science/system-design-load-balancing-1c2e7675fc27

1. Concepts

A load balancer is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. It helps scale horizontally across an ever-increasing number of servers.

Traditionally, load balancers have been used to distribute requests in a horizontally scaled infrastructure cluster, with systems replicated in multiple servers, where a single server can’t have sufficient power to handle all the demand.

Load Balancers also serve the purpose of decoupling clients and services, which is a good practice from a cloud architecture perspective.

2. How does the load balancer work?

Define IP or DNS name for LB: Administrators define one IP address and/or DNS name for a given application, task, or website, to which all requests will come. This IP address or DNS name is the load balancing server.
Add backend pool for LB: The administrator will then enter into the load balancing server the IP addresses of all the actual servers that will be sharing the workload for a given application or task. This pool of available servers is only accessible internally, via the load balancer.
Deploy LB: Finally, your load balancer needs to be deployed — either as a proxy, which sits between your app servers and your users worldwide and accepts all traffic, or as a gateway, which assigns a user to a server once and leaves the interaction alone thereafter.
Redirect requests: Once the load balancing system is in place, all requests to the application come to the load balancer and are redirected according to the administrator’s preferred algorithm.

3. Types of LBs

Load balancers are generally grouped into two categories: Layer 4 and Layer 7.

Layer 4 load balancers

It acts upon data found in network and transport layer protocols (IP, TCP, FTP, UDP). They are mostly the network address translators (NATs) which share the load to the different servers getting translated to by these load balancers.
Session persistence can be achieved at the IP address level
No termination for TCP connections

The 2 modes of L4 LB

Direct Server Return (DSR): The TCP connection is established directly between the client and the backend. The load-balancer sees only the requests and just changes the destination MAC address of the packets. The backends answer directly to the client using the service IP (VIP) configured on a loopback interface (this is why the src is VIP for response). Note that the requests pass through the load balancer while the responses not. LB won’t be the bandwidth bottleneck.

NAT Mode: The clients get connected to the service VIP. The load balancer chooses a server in the pool then forwards packets to it by changing the destination IP address (DNAT), the LB becomes the default gateway for the real servers, and the source IP is the client’s IP. All traffic will pass through the load balancer, and output bandwidth is limited by load balancer output capacity. There is only one connection established.

Example: Azure Load Balancer and not Azure Application Gateway, Nginx as TCP and UDP load balancer

Layer 7 load balancers

It distributes requests based upon data found in application layer protocols such as HTTP. They can further distribute requests based on application-specific data such as HTTP headers, cookies, or data within the application message itself, such as the value of a specific parameter.
For the client, the destination IP will be the IP of the load balancer, for the backend server, the source IP will be the IP of the load balancer.
The cookie can be used to achieve a sticky session.
IP of the client will be kept with the X-Forwarded-For header.
To get the HTTP information, the connection is terminated at the load balancer, thus, there will be 2 TCP connections: Client-LB, LB-Backend.

Example: Azure Application Gateway and not Azure Load Balancer, Nginx as HTTP load balancer

5. Algorithms

Least connection
Least response time
Least bandwidth
Round robin
Weighted round-robin
IP hash

Round Robin

What is it ?

Using this method, client requests are routed to available servers on a cyclical basis. Round robin server load balancing works best when servers have roughly identical computing capabilities and storage capacity.

How does it work ?

In a nutshell, round robin network load balancing rotates connection requests among web servers in the order that requests are received. For a simplified example, assume that an enterprise has a cluster of three servers: Server A, Server B, and Server C.

The first request is sent to Server A.
The second request is sent to Server B.
The third request is sent to Server C.

The load balancer continues passing requests to servers based on this order. This ensures that the server load is distributed evenly to handle high traffic.

What are the benefits ?

The biggest advantage of round robin load balancing is that it is simple to understand and implement. However, the simplicity of the round robin algorithm is also its biggest disadvantage, which is why many load balancers use weighted round robin or more complex algorithms.

What is the difference between Round Robin and Weight Round Robin ?

The weighted round robin load balancing algorithm allows site administrators to assign weights to each server based on criteria like traffic-handling capacity. Servers with higher weights receive a higher proportion of client requests. For a simplified example, assume that an enterprise has a cluster of three servers:

Server A can handle 15 requests per second, on average
Server B can handle 10 requests per second, on average
Server C can handle 5 requests per second, on average

Next, assume that the load balancer receives 6 requests.

3 requests are sent to Server A
2 requests are sent to Server B
1 request is sent to Server C.

In this manner, the weighted round robin algorithm distributes the load according to each server’s capacity.

Load Balancing