Distributed Systems Fundamentals
Understand core distributed systems concepts like coordination, consistency, fault tolerance, and trade-offs.
Why Systems Become Distributed
A single server works at small scale, but real systems outgrow it due to traffic, data size, and reliability demands.
- One machine has hard limits in CPU, memory, storage, and network throughput.
- Growing data and traffic create performance bottlenecks that vertical scaling cannot sustain.
- Single-server setups introduce a single point of failure, making high availability impossible.
Details
Most systems begin with a single server because it is simple to build and manage. At low scale, one machine can handle requests, execute logic, and store data without much complexity.
As usage grows, this model breaks down. Increased traffic leads to higher latency and resource exhaustion, while larger datasets slow down queries and strain storage systems.
Reliability also becomes a major issue. If that one server fails, the entire system goes offline. This is unacceptable for systems that require continuous availability.
To overcome these limits, systems distribute work across multiple machines. Instead of relying on one server, workloads are shared, enabling scalability and fault tolerance—but introducing new challenges in coordination and consistency.
What Is a Distributed System
A distributed system is multiple independent machines that coordinate over a network to behave like a single system.
- Each machine (node) operates independently with its own memory and resources.
- Nodes communicate through network calls instead of shared memory.
- The system is designed to appear as one unified service to users.
Details
In a distributed system, multiple servers work together to handle requests, store data, and execute logic. Each server is a separate machine, often running in different locations, but collectively they form a single system from the user’s perspective.
Unlike a single-machine system, these nodes do not share memory. Instead, they communicate by sending messages over a network, which introduces latency and the possibility of failure.
Because of this, coordination becomes essential. Nodes must agree on shared state, such as data values or system decisions, even though they are physically separated.
The key challenge is making this group of independent machines behave consistently and reliably as one system, despite delays, failures, and partial visibility across the network.
Replication
Replication duplicates data across multiple machines to improve availability, fault tolerance, and read performance.
- Data is copied from a primary node to one or more replica nodes.
- If one node fails, other replicas can continue serving requests.
- Read requests can be distributed across replicas to reduce load.
Details
Replication is a core technique in distributed systems where the same data is stored on multiple machines. Typically, one node acts as the primary, handling writes, while other nodes act as replicas that receive copies of the data.
This setup improves reliability. If the primary or any replica fails, the system can continue operating using the remaining nodes, reducing downtime.
Replication also improves performance, especially for read-heavy systems. Instead of sending all read requests to a single machine, requests can be distributed across multiple replicas.
However, replication introduces complexity. Updates must be propagated across nodes, and delays in synchronization can lead to temporary inconsistencies between replicas.
Partition Tolerance
Partition tolerance means a system continues operating even when parts of the network cannot communicate.
- Network failures can split systems into isolated groups of nodes.
- Nodes may continue running but cannot communicate with each other.
- Systems must handle these failures without completely shutting down.
Details
In distributed systems, machines communicate over networks, and networks are inherently unreliable. Connections can drop, messages can be delayed, and entire segments of the system can become unreachable.
A network partition occurs when nodes are split into groups that cannot communicate with each other. For example, Server A may be unable to reach Server B due to a network failure, even though both are still running.
This creates a fundamental challenge: each side of the partition may continue processing requests independently, potentially leading to conflicting states.
Partition tolerance means the system is designed to keep operating despite these failures, accepting that coordination between nodes may be temporarily impossible.
Consistency in Distributed Systems
Consistency defines whether all nodes in a distributed system see the same data at the same time.
- Strong consistency ensures every read returns the latest write across all nodes.
- Eventual consistency allows temporary differences but guarantees convergence over time.
- Higher consistency often comes at the cost of availability or performance.
Details
In distributed systems, data is often replicated across multiple machines, which introduces the problem of keeping that data in sync. Consistency defines how aligned those replicas are at any given moment.
With strong consistency, once a write occurs, all future reads—no matter which node they hit—return the updated value immediately. This requires tight coordination between nodes.
With eventual consistency, updates propagate over time. After a write, some nodes may temporarily return stale data, but all nodes will eventually converge to the same state.
The tradeoff is unavoidable. Systems that enforce strict consistency often sacrifice performance or availability, especially under network delays or failures.
CAP Theorem
The CAP theorem states that a distributed system can only guarantee two of three properties: consistency, availability, and partition tolerance.
- Consistency means all nodes return the same, up-to-date data.
- Availability means every request receives a response, even during failures.
- Partition tolerance means the system continues operating despite network splits.
Details
The CAP theorem captures a hard constraint in distributed systems: when a network partition happens, you cannot maintain both perfect consistency and full availability at the same time.
Consistency requires coordination between nodes. For every read to return the latest write, nodes must agree on the current state, which can require waiting for communication across the network.
Availability, on the other hand, prioritizes responsiveness. The system continues to serve requests even if some nodes are unreachable, which may result in returning stale or inconsistent data.
Partition tolerance is unavoidable because network failures will happen in any real system. This forces systems into a tradeoff: during a partition, they must either delay responses to preserve consistency or respond immediately and accept temporary inconsistency.
Distributed Coordination
Distributed coordination ensures multiple nodes agree on shared decisions and system state.
- Nodes must coordinate to avoid conflicts and inconsistent system behavior.
- Common patterns include leader election, distributed locks, and consensus.
- Specialized systems like ZooKeeper, etcd, and Consul handle coordination reliably.
Details
In distributed systems, multiple nodes operate independently, but many operations require them to act in a coordinated way. Without coordination, nodes could make conflicting decisions, leading to inconsistent data or system failures.
Leader election is a common pattern where one node is chosen to make decisions or handle critical tasks. This prevents multiple nodes from performing the same action simultaneously.
Distributed locks ensure that only one node can access or modify a shared resource at a time, similar to locks in single-machine concurrency but implemented across a network.
Consensus algorithms go further by allowing nodes to agree on a value or decision, even in the presence of failures. Tools like ZooKeeper, etcd, and Consul provide these coordination mechanisms so engineers do not have to implement them from scratch.
Challenges of Distributed Systems
Distributed systems introduce failure modes and complexity that do not exist in single-machine systems.
- Network failures and latency make communication unreliable and unpredictable.
- Replicated data can become temporarily inconsistent across nodes.
- Coordinating multiple machines introduces significant design complexity.
Details
Once a system is distributed, it must operate over a network, and networks are inherently unreliable. Messages can be delayed, dropped, or arrive out of order, making simple operations much harder to reason about.
Data consistency also becomes a challenge. With replication, different nodes may hold different versions of the same data at any given moment, especially under heavy load or network delays.
Coordination between nodes adds another layer of difficulty. Ensuring that multiple machines agree on shared state or execute tasks in the correct order requires complex protocols and careful design.
These challenges are not edge cases—they are fundamental. Designing systems that remain correct and reliable despite these issues is the core problem that distributed systems engineering aims to solve.
Question Section
1 / 5
This track is locked
Buy this track once to unlock all of its lessons.