Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
5. Advanced Concepts & Patterns
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

20. Leader Election & Consensus (Raft & Paxos)

How distributed nodes agree on a single leader and a single version of truth, even when things fail.

Mar 5, 202617 views0 likes0 fires
18px

[!CAUTION] In any distributed system with replication, you need nodes to agree on things: who is the leader, what is the committed state, what order did events happen in. Getting this wrong leads to split-brain — two nodes both think they are the leader, accept conflicting writes, and corrupt your data. Consensus algorithms like Raft and Paxos prevent this.

The Split-Brain Problem

Imagine a database cluster with a leader and two followers. A network partition occurs, isolating the leader from one follower:

  [Leader]  ←──X──→  [Follower B]
      │                    │
      ▼                    ▼
  [Follower A]        (isolated)

If Follower B cannot reach the Leader, it might conclude the leader is dead and promote itself to leader. Now you have two leaders accepting writes independently. When the partition heals, the data has diverged — which writes win? This is the split-brain problem, and it has caused real data loss at companies including GitHub (2012 MySQL incident).

Raft Consensus Algorithm

Raft was designed in 2013 by Diego Ongaro as a understandable alternative to Paxos. It decomposes consensus into three sub-problems: leader election, log replication, and safety.

Node States

Every node in a Raft cluster is in one of three states at any time:

  • Leader: Handles all client requests, replicates log entries to followers.
  • Follower: Passive — only responds to RPCs from the leader.
  • Candidate: Temporarily — a follower trying to become the new leader.

Step-by-Step: Leader Election

  1. All nodes start as Followers.
  2. Each follower has a randomized election timeout (e.g., 150–300ms). If a follower does not hear from a leader before its timeout expires, it becomes a Candidate.
  3. The Candidate increments its term number (a logical clock) and sends RequestVote RPCs to all other nodes.
  4. Each node votes for at most one candidate per term (first-come-first-served).
  5. If the Candidate receives votes from a majority (quorum), it becomes the new Leader.
  6. The new Leader immediately sends heartbeat messages to all followers to assert authority and prevent new elections.
Term 1: [A:Leader] ──heartbeat──→ [B:Follower] [C:Follower]

  (A crashes)

Term 2: [A:down]   [B:Candidate] ──RequestVote──→ [C:Follower]
                    [B] gets vote from C (majority: 2/3)
                    [B:Leader] ──heartbeat──→ [C:Follower]

Log Replication

Once elected, the leader handles all writes:

  1. Client sends a write request to the Leader.
  2. Leader appends the entry to its local log.
  3. Leader sends the entry to all Followers via AppendEntries RPC.
  4. When a majority of nodes have written the entry to their logs, the Leader considers it committed and applies it to the state machine.
  5. Leader notifies the client of success.

This ensures that a committed entry survives the failure of any minority of nodes.

Why Randomized Timeouts?

If all followers had the same election timeout, they would all become candidates simultaneously and split the vote (no majority). Randomized timeouts ensure that one node times out first, gets elected quickly, and suppresses other elections with its heartbeat.

Paxos: The Original (and Harder) Algorithm

Paxos was invented by Leslie Lamport in 1989. It solves the same problem as Raft but is notoriously difficult to understand and implement. Even Lamport''s original paper used an analogy of Greek parliament to explain it.

Paxos has three roles: Proposer (suggests a value), Acceptor (votes on proposals), and Learner (learns the decided value). The core guarantee: once a majority of acceptors agree on a value, that value is decided and can never be changed.

PropertyRaftPaxosZAB (ZooKeeper)
UnderstandabilityDesigned to be easyNotoriously hardModerate
Leader required?Yes (strong leader)No (leaderless Paxos exists)Yes
Notable usersetcd, CockroachDB, Consul, TiKVGoogle Chubby, SpannerApache ZooKeeper, Kafka (old)
Ordering guaranteeTotal order via log indexPer-slot agreementTotal order via zxid

Real-World Usage

etcd & Kubernetes

Kubernetes stores all cluster state (pods, services, deployments, configs) in etcd, a distributed key-value store that uses Raft. When you run kubectl apply, your config is written to the etcd leader, which replicates it to followers. If the etcd leader crashes, Raft automatically elects a new leader. This is why Kubernetes recommends running 3 or 5 etcd nodes — a quorum of 2/3 or 3/5 survives one or two failures.

CockroachDB

CockroachDB is a distributed SQL database that uses Raft for consensus on every data range. Each range (a partition of the keyspace) has its own Raft group with a leader and followers. This gives CockroachDB strong consistency (serializable isolation) across a globally distributed cluster — something MySQL and PostgreSQL alone cannot do.

Kafka KRaft

Apache Kafka historically depended on ZooKeeper for leader election and metadata management. Since Kafka 3.x, they have been replacing ZooKeeper with KRaft — a built-in Raft implementation. This removes an entire operational dependency and simplifies Kafka cluster management.

Raft Step-by-Step: A Complete Example

Let''s trace through a complete Raft scenario with 5 nodes:

Initial state: Node A is leader, term=3

Client sends write "x=5" to Node A (leader)
Node A appends "x=5" to its log at index 7
Node A sends AppendEntries RPC to B, C, D, E
Nodes B, C, D respond with success (E is slow)
Node A has 4/5 confirmations (majority!) → commits entry
Node A responds to client: "success"
Node A notifies followers of committed index
Followers apply "x=5" to their state machines

--- Leader failure ---9. Node A crashes. Nodes B-E stop receiving heartbeats10. After random timeout (150-300ms), Node C starts election11. Node C increments term to 4, votes for itself12. Node C requests votes from B, D, E13. B and D grant votes (C has the most up-to-date log)14. C has 3/5 votes → becomes new leader for term 415. C starts sending heartbeats to B, D, E16. When A recovers, it sees term=4, steps down to follower

Paxos vs Raft: Detailed Comparison

AspectRaftPaxosWinner
UnderstandabilityDesigned for clarity — 3 subproblemsNotoriously hard to understandRaft
Leader requirementRequires a stable leaderLeaderless (Multi-Paxos adds leader)Paxos (flexibility)
Implementationetcd, Consul, CockroachDB, TiKVGoogle Chubby, SpannerTie
ThroughputLimited by leader bottleneckCan parallelize proposalsPaxos
Latency2 RTTs (1 replication + 1 commit)2 RTTs (prepare + accept)Tie
Industry adoptionWidely adopted, many open-source libsMostly Google internallyRaft

ZooKeeper and ZAB Protocol

Apache ZooKeeper uses a Raft-like protocol called ZAB (ZooKeeper Atomic Broadcast) for consensus. Key systems that depend on ZooKeeper:

  • Kafka (legacy): Broker metadata, topic configs, controller election. Being replaced by KRaft.
  • HBase: Region server tracking, master election.
  • Hadoop HDFS: NameNode high-availability with automatic failover.
  • Apache Solr: Leader election for SolrCloud.

Byzantine Fault Tolerance (BFT)

Raft and Paxos assume nodes are honest but might crash (crash-fault tolerance). Byzantine faults are when nodes can lie, cheat, or act maliciously. BFT protocols (like PBFT, used in some blockchains) can tolerate up to f malicious nodes out of 3f+1 total — but at a significant performance cost.

Fault ModelTolerated FailuresExamples
Crash Fault Tolerance (CFT)f failures out of 2f+1 nodesRaft, Paxos, ZAB
Byzantine Fault Tolerance (BFT)f failures out of 3f+1 nodesPBFT, Tendermint

For internal distributed systems (databases, queues), CFT (Raft/Paxos) is sufficient. BFT is only needed when you can''t trust the participants (e.g., public blockchains).

Common Mistakes

  • ❌ Running an even number of consensus nodes — 4 nodes tolerate 1 failure (same as 3 nodes!). Always use odd numbers: 3, 5, or 7.
  • ❌ Implementing your own consensus algorithm — the bugs are subtle and catastrophic. Use battle-tested implementations (etcd, ZooKeeper, Consul).
  • ❌ Confusing consensus with replication — replication copies data; consensus ensures agreement on the order and content of that data.
  • ❌ Not tuning election timeouts — too short causes unnecessary elections (thrashing); too long means slow failover. Typical Raft range: 150-300ms.
  • ❌ Assuming consensus is free — every write requires a majority quorum round-trip. High-throughput systems shard data into many consensus groups (e.g., CockroachDB ranges).

[!TIP] Key Takeaways:
• Split-brain = two leaders = data corruption. Consensus prevents this.
• Raft: leader election via randomized timeouts + majority vote, log replication via quorum.
• Paxos: older, harder, but used by Google (Chubby, Spanner).
• Always run odd-numbered clusters (3, 5) for fault tolerance.
• Used by etcd (Kubernetes), CockroachDB, Kafka KRaft, Consul, TiKV.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

hardSystem Design
Quiz: Leader Election & Consensus
5 questions5 min

Continue Learning

21. Distributed Transactions (Saga, 2PC, Outbox)

Advanced
18 min

22. Event Sourcing & CQRS

Advanced
18 min

23. Unique ID Generation at Scale

Intermediate
12 min
Lesson 4 of 12 in 5. Advanced Concepts & Patterns
Previous in 5. Advanced Concepts & Patterns
19. Database Replication
Next in 5. Advanced Concepts & Patterns
21. Distributed Transactions (Saga, 2PC, Outbox)
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories