System Scaling
Learn strategies for scaling backend systems to handle growth in traffic, data, and complexity.
Why Systems Need to Scale
As applications grow, increasing users, requests, and data eventually overwhelm a single server, requiring systems to scale.
- More users generate more concurrent requests that increase system load.
- Growing data volumes slow down processing and storage operations.
- A single server becomes a performance bottleneck under sustained demand.
Details
All systems start simple, often running on a single machine. At low scale, this works well because the server can handle incoming requests, process logic, and manage data without significant strain.
As the application grows, demand increases. More users means more simultaneous requests, which puts pressure on CPU, memory, and network resources.
Data growth adds another layer of stress. Larger datasets require more storage and make queries slower, further increasing response times.
Eventually, the system reaches a point where one machine can no longer keep up. Scaling becomes necessary to maintain performance, prevent slowdowns, and ensure the system continues to operate reliably under increasing load.
Vertical Scaling
Vertical scaling increases the capacity of a single machine by adding more resources like CPU, memory, and storage.
- Upgrading hardware allows a single server to handle more load.
- No need to manage multiple machines or distributed coordination.
- Scaling is limited by hardware constraints and rising costs.
Details
Vertical scaling, also known as scaling up, improves system capacity by upgrading a single server. This typically involves adding more CPU cores, increasing RAM, or expanding storage.
This approach is straightforward. The system architecture remains unchanged, and there is no need to introduce distributed system complexity such as network communication or data synchronization.
However, vertical scaling has hard limits. Physical machines can only be upgraded to a certain point, and high-end hardware becomes increasingly expensive.
Because of these limits, vertical scaling works well in early stages but eventually becomes insufficient for large-scale systems.
Horizontal Scaling
Horizontal scaling increases system capacity by adding more servers and distributing workload across them.
- Workload is distributed across multiple machines instead of one.
- System can handle significantly higher traffic by scaling out.
- Failure of one server does not bring down the entire system.
Details
Horizontal scaling, also known as scaling out, improves system capacity by adding more servers instead of upgrading a single one. Each server handles a portion of the total workload.
Incoming traffic is distributed across these servers, allowing the system to process many more requests in parallel compared to a single-machine setup.
This approach also improves reliability. If one server fails, others can continue handling requests, reducing the impact of failures.
However, horizontal scaling introduces distributed system complexity, requiring coordination, load balancing, and data consistency mechanisms.
Load Balancing
A load balancer distributes incoming requests across multiple servers to improve performance and reliability.
- Requests are distributed across servers to prevent overload on a single machine.
- Unhealthy servers are detected and removed from traffic routing.
- System reliability improves by avoiding single points of failure.
Details
As systems scale horizontally, incoming traffic must be distributed across multiple servers. A load balancer sits in front of these servers and acts as the entry point for all requests.
Instead of sending all traffic to one machine, the load balancer uses algorithms such as round-robin or least connections to distribute requests evenly. This prevents any single server from becoming a bottleneck.
Load balancers also continuously check the health of servers. If a server becomes slow, crashes, or fails health checks, it is automatically removed from the pool so users are not routed to a broken instance.
In addition, load balancers can handle tasks like SSL termination, request routing based on paths, and traffic shaping. Tools like Nginx, HAProxy, and cloud-managed load balancers are widely used to implement this layer in modern systems.
Load Balancing Strategies
Different load balancing strategies determine how traffic is distributed, directly impacting performance, fairness, and system stability.
- Round-robin distributes requests evenly but ignores server load.
- Least connections sends traffic to the least busy server.
- Sticky sessions keep users tied to the same server when needed.
Details
A load balancer is not just a traffic splitter — how it distributes traffic matters.
The simplest strategy is round-robin, where each request goes to the next server in sequence. This works when servers are identical, but breaks when workloads vary.
Least connections improves this by routing traffic to the server handling the fewest active requests, reducing overload and improving response times.
Sticky sessions keep a user tied to the same server. This helps when session data is stored locally, but reduces flexibility and can create uneven load.
Modern systems combine strategies with health checks and weights. Stronger servers can take more traffic, while failing ones are removed automatically.
Poor strategy selection leads to uneven load, higher latency, and wasted infrastructure, even if a load balancer is present.
Stateless Servers
Stateless servers do not store client-specific data locally, allowing any server to handle any request.
- Servers do not retain session data between requests.
- State is stored externally in databases, caches, or session stores.
- Requests can be routed to any server without dependency on prior interactions.
Details
In a stateless system, each request is independent. The server does not store information about previous requests or user sessions in its local memory.
Instead, any required state—such as user sessions or authentication data—is stored in external systems like databases, caches, or dedicated session stores.
This design makes horizontal scaling much easier. Since no server holds unique session data, requests can be routed to any available server without concern for where previous requests were handled.
Stateless architecture also improves reliability and flexibility, as servers can be added, removed, or replaced without disrupting user interactions.
Autoscaling
Autoscaling dynamically adjusts the number of servers based on traffic to maintain performance and efficiency.
- System automatically adds servers when traffic increases.
- Servers are removed when demand decreases to reduce resource usage.
- Scaling decisions are based on metrics like CPU usage or request rate.
Details
Traffic in real systems is not constant. Usage can spike during peak hours and drop during off-peak times, making fixed infrastructure inefficient.
Autoscaling solves this by automatically adjusting system capacity. When traffic increases, new servers are added to handle the load. When traffic decreases, unused servers are removed.
This process is typically driven by metrics such as CPU utilization, memory usage, or request rate. Thresholds are defined so the system can react quickly to changes in demand.
Autoscaling improves both performance and cost efficiency. Systems can handle sudden spikes without manual intervention while avoiding unnecessary infrastructure costs during low usage periods.
Content Delivery Networks (CDNs)
CDNs cache content closer to users to reduce latency and offload traffic from application servers.
- Static content is cached on edge servers located near users.
- Requests are served from the nearest location instead of the origin server.
- Reduces latency and decreases load on backend infrastructure.
Details
When users access an application, data typically travels from a central server, which may be far away geographically. This distance introduces latency and slows down response times.
CDNs solve this by caching static content—such as images, scripts, and stylesheets—on edge servers distributed around the world. When a user makes a request, it is served from the closest edge location.
This significantly reduces the distance data must travel, resulting in faster load times and improved user experience.
CDNs also reduce load on origin servers by handling a large portion of traffic. Services like Cloudflare, AWS CloudFront, and Fastly are commonly used to implement this layer in scalable systems.
Scalable System Architecture
Scalable systems are built as layered architectures where each component handles a specific part of the workload.
- Traffic flows through multiple layers, each designed to handle scale efficiently.
- Load is distributed across application servers and databases.
- Each layer can scale independently based on system demand.
Details
A scalable system is not a single component but a combination of layers working together to handle increasing demand. Each layer is responsible for a specific function in the request flow.
Users first interact with a CDN, which serves cached content and reduces latency. Requests that require dynamic processing are forwarded to a load balancer.
The load balancer distributes traffic across multiple application servers, allowing the system to process many requests in parallel.
Application servers interact with a distributed database, where data is partitioned or replicated to handle large datasets and high query volumes.
This layered approach allows each part of the system to scale independently, making it possible to handle growth in users, traffic, and data without redesigning the entire system.
Question Section
1 / 5
This track is locked
Buy this track once to unlock all of its lessons.