System Scaling

Why Systems Need to Scale

As applications grow, increasing users, requests, and data eventually overwhelm a single server, requiring systems to scale.

👥

users

📦

🖥️

server

load

requests increase → load builds

🖥️

More users generate more concurrent requests that increase system load.
Growing data volumes slow down processing and storage operations.
A single server becomes a performance bottleneck under sustained demand.

Details

All systems start simple, often running on a single machine. At low scale, this works well because the server can handle incoming requests, process logic, and manage data without significant strain.

As the application grows, demand increases. More users means more simultaneous requests, which puts pressure on CPU, memory, and network resources.

Data growth adds another layer of stress. Larger datasets require more storage and make queries slower, further increasing response times.

Eventually, the system reaches a point where one machine can no longer keep up. Scaling becomes necessary to maintain performance, prevent slowdowns, and ensure the system continues to operate reliably under increasing load.

Vertical Scaling

Vertical scaling increases the capacity of a single machine by adding more resources like CPU, memory, and storage.

🖥️

Load rises on a single machine

Upgrading hardware allows a single server to handle more load.
No need to manage multiple machines or distributed coordination.
Scaling is limited by hardware constraints and rising costs.

Details

Vertical scaling, also known as scaling up, improves system capacity by upgrading a single server. This typically involves adding more CPU cores, increasing RAM, or expanding storage.

This approach is straightforward. The system architecture remains unchanged, and there is no need to introduce distributed system complexity such as network communication or data synchronization.

However, vertical scaling has hard limits. Physical machines can only be upgraded to a certain point, and high-end hardware becomes increasingly expensive.

Because of these limits, vertical scaling works well in early stages but eventually becomes insufficient for large-scale systems.

Horizontal Scaling

Horizontal scaling increases system capacity by adding more servers and distributing workload across them.

🖥️

Load builds across the system

Workload is distributed across multiple machines instead of one.
System can handle significantly higher traffic by scaling out.
Failure of one server does not bring down the entire system.

Details

Horizontal scaling, also known as scaling out, improves system capacity by adding more servers instead of upgrading a single one. Each server handles a portion of the total workload.

Incoming traffic is distributed across these servers, allowing the system to process many more requests in parallel compared to a single-machine setup.

This approach also improves reliability. If one server fails, others can continue handling requests, reducing the impact of failures.

However, horizontal scaling introduces distributed system complexity, requiring coordination, load balancing, and data consistency mechanisms.

Load Balancing

A load balancer distributes incoming requests across multiple servers to improve performance and reliability.

Incoming Requests→

⚖️ Load Balancer

distributes traffic

🖥️

100%

More servers added → load spreads out

Requests are distributed across servers to prevent overload on a single machine.
Unhealthy servers are detected and removed from traffic routing.
System reliability improves by avoiding single points of failure.

Details

As systems scale horizontally, incoming traffic must be distributed across multiple servers. A load balancer sits in front of these servers and acts as the entry point for all requests.

Instead of sending all traffic to one machine, the load balancer uses algorithms such as round-robin or least connections to distribute requests evenly. This prevents any single server from becoming a bottleneck.

Load balancers also continuously check the health of servers. If a server becomes slow, crashes, or fails health checks, it is automatically removed from the pool so users are not routed to a broken instance.

In addition, load balancers can handle tasks like SSL termination, request routing based on paths, and traffic shaping. Tools like Nginx, HAProxy, and cloud-managed load balancers are widely used to implement this layer in modern systems.

Load Balancing Strategies

Different load balancing strategies determine how traffic is distributed, directly impacting performance, fairness, and system stability.

Round Robin

👤

even but ignores load

Round-robin distributes requests evenly but ignores server load.
Least connections sends traffic to the least busy server.
Sticky sessions keep users tied to the same server when needed.

Details

A load balancer is not just a traffic splitter — how it distributes traffic matters.

The simplest strategy is round-robin, where each request goes to the next server in sequence. This works when servers are identical, but breaks when workloads vary.

Least connections improves this by routing traffic to the server handling the fewest active requests, reducing overload and improving response times.

Sticky sessions keep a user tied to the same server. This helps when session data is stored locally, but reduces flexibility and can create uneven load.

Modern systems combine strategies with health checks and weights. Stronger servers can take more traffic, while failing ones are removed automatically.

Poor strategy selection leads to uneven load, higher latency, and wasted infrastructure, even if a load balancer is present.

Stateless Servers

Stateless servers do not store client-specific data locally, allowing any server to handle any request.

Stateless Routing

👤A

📦

👤B

📦

👤C

📦

🗄️

Shared State Store

session data stored externally

requests can go to any server

Servers do not retain session data between requests.
State is stored externally in databases, caches, or session stores.
Requests can be routed to any server without dependency on prior interactions.

Details

In a stateless system, each request is independent. The server does not store information about previous requests or user sessions in its local memory.

Instead, any required state—such as user sessions or authentication data—is stored in external systems like databases, caches, or dedicated session stores.

This design makes horizontal scaling much easier. Since no server holds unique session data, requests can be routed to any available server without concern for where previous requests were handled.

Stateless architecture also improves reliability and flexibility, as servers can be added, removed, or replaced without disrupting user interactions.

Autoscaling

Autoscaling dynamically adjusts the number of servers based on traffic to maintain performance and efficiency.

Traffic Load20%

🖥️10%

⬇️ scale down

servers adjust based on load

System automatically adds servers when traffic increases.
Servers are removed when demand decreases to reduce resource usage.
Scaling decisions are based on metrics like CPU usage or request rate.

Details

Traffic in real systems is not constant. Usage can spike during peak hours and drop during off-peak times, making fixed infrastructure inefficient.

Autoscaling solves this by automatically adjusting system capacity. When traffic increases, new servers are added to handle the load. When traffic decreases, unused servers are removed.

This process is typically driven by metrics such as CPU utilization, memory usage, or request rate. Thresholds are defined so the system can react quickly to changes in demand.

Autoscaling improves both performance and cost efficiency. Systems can handle sudden spikes without manual intervention while avoiding unnecessary infrastructure costs during low usage periods.

Content Delivery Networks (CDNs)

CDNs cache content closer to users to reduce latency and offload traffic from application servers.

🏢

Origin

🌐

Edge

🌐

Edge

👤

📦

Cache Miss → Origin Fetch

content served from nearest edge when cached

Static content is cached on edge servers located near users.
Requests are served from the nearest location instead of the origin server.
Reduces latency and decreases load on backend infrastructure.

Details

When users access an application, data typically travels from a central server, which may be far away geographically. This distance introduces latency and slows down response times.

CDNs solve this by caching static content—such as images, scripts, and stylesheets—on edge servers distributed around the world. When a user makes a request, it is served from the closest edge location.

This significantly reduces the distance data must travel, resulting in faster load times and improved user experience.

CDNs also reduce load on origin servers by handling a large portion of traffic. Services like Cloudflare, AWS CloudFront, and Fastly are commonly used to implement this layer in scalable systems.

Scalable System Architecture

Scalable systems are built as layered architectures where each component handles a specific part of the workload.

👤

⚖️

🖥️

🗄️

📦

normal traffic flow

more traffic creates denser flow and triggers scaling

Traffic flows through multiple layers, each designed to handle scale efficiently.
Load is distributed across application servers and databases.
Each layer can scale independently based on system demand.

Details

A scalable system is not a single component but a combination of layers working together to handle increasing demand. Each layer is responsible for a specific function in the request flow.

Users first interact with a CDN, which serves cached content and reduces latency. Requests that require dynamic processing are forwarded to a load balancer.

The load balancer distributes traffic across multiple application servers, allowing the system to process many requests in parallel.

Application servers interact with a distributed database, where data is partitioned or replicated to handle large datasets and high query volumes.

This layered approach allows each part of the system to scale independently, making it possible to handle growth in users, traffic, and data without redesigning the entire system.

Question Section

Try to answer in your own words first, then flip the card to check.

1 / 5

Why Systems Need to Scale

As applications grow, increasing users, requests, and data eventually overwhelm a single server, requiring systems to scale.

👥

users

📦

🖥️

server

load

requests increase → load builds

🖥️

More users generate more concurrent requests that increase system load.
Growing data volumes slow down processing and storage operations.
A single server becomes a performance bottleneck under sustained demand.

Details

All systems start simple, often running on a single machine. At low scale, this works well because the server can handle incoming requests, process logic, and manage data without significant strain.

As the application grows, demand increases. More users means more simultaneous requests, which puts pressure on CPU, memory, and network resources.

Data growth adds another layer of stress. Larger datasets require more storage and make queries slower, further increasing response times.

Vertical Scaling

Vertical scaling increases the capacity of a single machine by adding more resources like CPU, memory, and storage.

🖥️

Load rises on a single machine

Upgrading hardware allows a single server to handle more load.
No need to manage multiple machines or distributed coordination.
Scaling is limited by hardware constraints and rising costs.

Details

Vertical scaling, also known as scaling up, improves system capacity by upgrading a single server. This typically involves adding more CPU cores, increasing RAM, or expanding storage.

This approach is straightforward. The system architecture remains unchanged, and there is no need to introduce distributed system complexity such as network communication or data synchronization.

However, vertical scaling has hard limits. Physical machines can only be upgraded to a certain point, and high-end hardware becomes increasingly expensive.

Because of these limits, vertical scaling works well in early stages but eventually becomes insufficient for large-scale systems.

Horizontal Scaling

Horizontal scaling increases system capacity by adding more servers and distributing workload across them.

🖥️

Load builds across the system

Workload is distributed across multiple machines instead of one.
System can handle significantly higher traffic by scaling out.
Failure of one server does not bring down the entire system.

Details

Horizontal scaling, also known as scaling out, improves system capacity by adding more servers instead of upgrading a single one. Each server handles a portion of the total workload.

Incoming traffic is distributed across these servers, allowing the system to process many more requests in parallel compared to a single-machine setup.

This approach also improves reliability. If one server fails, others can continue handling requests, reducing the impact of failures.

However, horizontal scaling introduces distributed system complexity, requiring coordination, load balancing, and data consistency mechanisms.

Load Balancing

A load balancer distributes incoming requests across multiple servers to improve performance and reliability.

Incoming Requests→

⚖️ Load Balancer

distributes traffic

🖥️

100%

More servers added → load spreads out

Requests are distributed across servers to prevent overload on a single machine.
Unhealthy servers are detected and removed from traffic routing.
System reliability improves by avoiding single points of failure.

Details

As systems scale horizontally, incoming traffic must be distributed across multiple servers. A load balancer sits in front of these servers and acts as the entry point for all requests.

Load Balancing Strategies

Different load balancing strategies determine how traffic is distributed, directly impacting performance, fairness, and system stability.

Round Robin

👤

even but ignores load

Round-robin distributes requests evenly but ignores server load.
Least connections sends traffic to the least busy server.
Sticky sessions keep users tied to the same server when needed.

Details

A load balancer is not just a traffic splitter — how it distributes traffic matters.

The simplest strategy is round-robin, where each request goes to the next server in sequence. This works when servers are identical, but breaks when workloads vary.

Least connections improves this by routing traffic to the server handling the fewest active requests, reducing overload and improving response times.

Sticky sessions keep a user tied to the same server. This helps when session data is stored locally, but reduces flexibility and can create uneven load.

Modern systems combine strategies with health checks and weights. Stronger servers can take more traffic, while failing ones are removed automatically.

Poor strategy selection leads to uneven load, higher latency, and wasted infrastructure, even if a load balancer is present.

Stateless Servers

Stateless servers do not store client-specific data locally, allowing any server to handle any request.

Stateless Routing

👤A

📦

👤B

📦

👤C

📦

🗄️

Shared State Store

session data stored externally

requests can go to any server

Servers do not retain session data between requests.
State is stored externally in databases, caches, or session stores.
Requests can be routed to any server without dependency on prior interactions.

Details

In a stateless system, each request is independent. The server does not store information about previous requests or user sessions in its local memory.

Instead, any required state—such as user sessions or authentication data—is stored in external systems like databases, caches, or dedicated session stores.

This design makes horizontal scaling much easier. Since no server holds unique session data, requests can be routed to any available server without concern for where previous requests were handled.

Stateless architecture also improves reliability and flexibility, as servers can be added, removed, or replaced without disrupting user interactions.

Autoscaling

Autoscaling dynamically adjusts the number of servers based on traffic to maintain performance and efficiency.

Traffic Load20%

🖥️10%

⬇️ scale down

servers adjust based on load

System automatically adds servers when traffic increases.
Servers are removed when demand decreases to reduce resource usage.
Scaling decisions are based on metrics like CPU usage or request rate.

Details

Traffic in real systems is not constant. Usage can spike during peak hours and drop during off-peak times, making fixed infrastructure inefficient.

Autoscaling solves this by automatically adjusting system capacity. When traffic increases, new servers are added to handle the load. When traffic decreases, unused servers are removed.

This process is typically driven by metrics such as CPU utilization, memory usage, or request rate. Thresholds are defined so the system can react quickly to changes in demand.

Autoscaling improves both performance and cost efficiency. Systems can handle sudden spikes without manual intervention while avoiding unnecessary infrastructure costs during low usage periods.

Content Delivery Networks (CDNs)

CDNs cache content closer to users to reduce latency and offload traffic from application servers.

🏢

Origin

🌐

Edge

🌐

Edge

👤

📦

Cache Miss → Origin Fetch

content served from nearest edge when cached

Static content is cached on edge servers located near users.
Requests are served from the nearest location instead of the origin server.
Reduces latency and decreases load on backend infrastructure.

Details

When users access an application, data typically travels from a central server, which may be far away geographically. This distance introduces latency and slows down response times.

This significantly reduces the distance data must travel, resulting in faster load times and improved user experience.

CDNs also reduce load on origin servers by handling a large portion of traffic. Services like Cloudflare, AWS CloudFront, and Fastly are commonly used to implement this layer in scalable systems.

Scalable System Architecture

Scalable systems are built as layered architectures where each component handles a specific part of the workload.

👤

⚖️

🖥️

🗄️

📦

normal traffic flow

more traffic creates denser flow and triggers scaling

Traffic flows through multiple layers, each designed to handle scale efficiently.
Load is distributed across application servers and databases.
Each layer can scale independently based on system demand.

Details

A scalable system is not a single component but a combination of layers working together to handle increasing demand. Each layer is responsible for a specific function in the request flow.

Users first interact with a CDN, which serves cached content and reduces latency. Requests that require dynamic processing are forwarded to a load balancer.

The load balancer distributes traffic across multiple application servers, allowing the system to process many requests in parallel.

Application servers interact with a distributed database, where data is partitioned or replicated to handle large datasets and high query volumes.

This layered approach allows each part of the system to scale independently, making it possible to handle growth in users, traffic, and data without redesigning the entire system.

Question Section

Try to answer in your own words first, then flip the card to check.

1 / 5

System Scaling

Why Systems Need to Scale

Vertical Scaling

Horizontal Scaling

Load Balancing

Load Balancing Strategies

Stateless Servers

Autoscaling

Content Delivery Networks (CDNs)

Scalable System Architecture

Question Section

Related lessons

Cookie Consent

System Scaling

Why Systems Need to Scale

Vertical Scaling

Horizontal Scaling

Load Balancing

Load Balancing Strategies

Stateless Servers

Autoscaling

Content Delivery Networks (CDNs)

Scalable System Architecture

Question Section

Related lessons