20. Leader Election & Consensus (Raft & Paxos)

[!CAUTION] In any distributed system with replication, you need nodes to agree on things: who is the leader, what is the committed state, what order did events happen in. Getting this wrong leads to split-brain — two nodes both think they are the leader, accept conflicting writes, and corrupt your data. Consensus algorithms like Raft and Paxos prevent this.

The Split-Brain Problem

Imagine a database cluster with a leader and two followers. A network partition occurs, isolating the leader from one follower:

  [Leader]  ←──X──→  [Follower B]
      │                    │
      ▼                    ▼
  [Follower A]        (isolated)

If Follower B cannot reach the Leader, it might conclude the leader is dead and promote itself to leader. Now you have two leaders accepting writes independently. When the partition heals, the data has diverged — which writes win? This is the split-brain problem, and it has caused real data loss at companies including GitHub (2012 MySQL incident).

Raft Consensus Algorithm

Raft was designed in 2013 by Diego Ongaro as a understandable alternative to Paxos. It decomposes consensus into three sub-problems: leader election, log replication, and safety.

Node States

Every node in a Raft cluster is in one of three states at any time:

Leader: Handles all client requests, replicates log entries to followers.
Follower: Passive — only responds to RPCs from the leader.
Candidate: Temporarily — a follower trying to become the new leader.

Step-by-Step: Leader Election

All nodes start as Followers.
Each follower has a randomized election timeout (e.g., 150–300ms). If a follower does not hear from a leader before its timeout expires, it becomes a Candidate.
The Candidate increments its term number (a logical clock) and sends RequestVote RPCs to all other nodes.
Each node votes for at most one candidate per term (first-come-first-served).
If the Candidate receives votes from a majority (quorum), it becomes the new Leader.
The new Leader immediately sends heartbeat messages to all followers to assert authority and prevent new elections.

Term 1: [A:Leader] ──heartbeat──→ [B:Follower] [C:Follower]

  (A crashes)

Term 2: [A:down]   [B:Candidate] ──RequestVote──→ [C:Follower]
                    [B] gets vote from C (majority: 2/3)
                    [B:Leader] ──heartbeat──→ [C:Follower]

Log Replication

Once elected, the leader handles all writes:

Client sends a write request to the Leader.
Leader appends the entry to its local log.
Leader sends the entry to all Followers via AppendEntries RPC.
When a majority of nodes have written the entry to their logs, the Leader considers it committed and applies it to the state machine.
Leader notifies the client of success.

This ensures that a committed entry survives the failure of any minority of nodes.

Why Randomized Timeouts?

If all followers had the same election timeout, they would all become candidates simultaneously and split the vote (no majority). Randomized timeouts ensure that one node times out first, gets elected quickly, and suppresses other elections with its heartbeat.

Paxos: The Original (and Harder) Algorithm

Paxos was invented by Leslie Lamport in 1989. It solves the same problem as Raft but is notoriously difficult to understand and implement. Even Lamport''s original paper used an analogy of Greek parliament to explain it.

Paxos has three roles: Proposer (suggests a value), Acceptor (votes on proposals), and Learner (learns the decided value). The core guarantee: once a majority of acceptors agree on a value, that value is decided and can never be changed.

Property	Raft	Paxos	ZAB (ZooKeeper)
Understandability	Designed to be easy	Notoriously hard	Moderate
Leader required?	Yes (strong leader)	No (leaderless Paxos exists)	Yes
Notable users	etcd, CockroachDB, Consul, TiKV	Google Chubby, Spanner	Apache ZooKeeper, Kafka (old)
Ordering guarantee	Total order via log index	Per-slot agreement	Total order via zxid

Real-World Usage

etcd & Kubernetes

Kubernetes stores all cluster state (pods, services, deployments, configs) in etcd, a distributed key-value store that uses Raft. When you run kubectl apply, your config is written to the etcd leader, which replicates it to followers. If the etcd leader crashes, Raft automatically elects a new leader. This is why Kubernetes recommends running 3 or 5 etcd nodes — a quorum of 2/3 or 3/5 survives one or two failures.

CockroachDB

CockroachDB is a distributed SQL database that uses Raft for consensus on every data range. Each range (a partition of the keyspace) has its own Raft group with a leader and followers. This gives CockroachDB strong consistency (serializable isolation) across a globally distributed cluster — something MySQL and PostgreSQL alone cannot do.

Kafka KRaft

Apache Kafka historically depended on ZooKeeper for leader election and metadata management. Since Kafka 3.x, they have been replacing ZooKeeper with KRaft — a built-in Raft implementation. This removes an entire operational dependency and simplifies Kafka cluster management.

Raft Step-by-Step: A Complete Example

Let''s trace through a complete Raft scenario with 5 nodes:

Initial state: Node A is leader, term=3

Client sends write "x=5" to Node A (leader)
Node A appends "x=5" to its log at index 7
Node A sends AppendEntries RPC to B, C, D, E
Nodes B, C, D respond with success (E is slow)
Node A has 4/5 confirmations (majority!) → commits entry
Node A responds to client: "success"
Node A notifies followers of committed index
Followers apply "x=5" to their state machines

--- Leader failure ---9. Node A crashes. Nodes B-E stop receiving heartbeats10. After random timeout (150-300ms), Node C starts election11. Node C increments term to 4, votes for itself12. Node C requests votes from B, D, E13. B and D grant votes (C has the most up-to-date log)14. C has 3/5 votes → becomes new leader for term 415. C starts sending heartbeats to B, D, E16. When A recovers, it sees term=4, steps down to follower

Paxos vs Raft: Detailed Comparison

Aspect	Raft	Paxos	Winner
Understandability	Designed for clarity — 3 subproblems	Notoriously hard to understand	Raft
Leader requirement	Requires a stable leader	Leaderless (Multi-Paxos adds leader)	Paxos (flexibility)
Implementation	etcd, Consul, CockroachDB, TiKV	Google Chubby, Spanner	Tie
Throughput	Limited by leader bottleneck	Can parallelize proposals	Paxos
Latency	2 RTTs (1 replication + 1 commit)	2 RTTs (prepare + accept)	Tie
Industry adoption	Widely adopted, many open-source libs	Mostly Google internally	Raft

ZooKeeper and ZAB Protocol

Apache ZooKeeper uses a Raft-like protocol called ZAB (ZooKeeper Atomic Broadcast) for consensus. Key systems that depend on ZooKeeper:

Kafka (legacy): Broker metadata, topic configs, controller election. Being replaced by KRaft.
HBase: Region server tracking, master election.
Hadoop HDFS: NameNode high-availability with automatic failover.
Apache Solr: Leader election for SolrCloud.

Byzantine Fault Tolerance (BFT)

Raft and Paxos assume nodes are honest but might crash (crash-fault tolerance). Byzantine faults are when nodes can lie, cheat, or act maliciously. BFT protocols (like PBFT, used in some blockchains) can tolerate up to f malicious nodes out of 3f+1 total — but at a significant performance cost.

Fault Model	Tolerated Failures	Examples
Crash Fault Tolerance (CFT)	f failures out of 2f+1 nodes	Raft, Paxos, ZAB
Byzantine Fault Tolerance (BFT)	f failures out of 3f+1 nodes	PBFT, Tendermint

For internal distributed systems (databases, queues), CFT (Raft/Paxos) is sufficient. BFT is only needed when you can''t trust the participants (e.g., public blockchains).

Common Mistakes

❌ Running an even number of consensus nodes — 4 nodes tolerate 1 failure (same as 3 nodes!). Always use odd numbers: 3, 5, or 7.
❌ Implementing your own consensus algorithm — the bugs are subtle and catastrophic. Use battle-tested implementations (etcd, ZooKeeper, Consul).
❌ Confusing consensus with replication — replication copies data; consensus ensures agreement on the order and content of that data.
❌ Not tuning election timeouts — too short causes unnecessary elections (thrashing); too long means slow failover. Typical Raft range: 150-300ms.
❌ Assuming consensus is free — every write requires a majority quorum round-trip. High-throughput systems shard data into many consensus groups (e.g., CockroachDB ranges).

[!TIP] Key Takeaways:
• Split-brain = two leaders = data corruption. Consensus prevents this.
• Raft: leader election via randomized timeouts + majority vote, log replication via quorum.
• Paxos: older, harder, but used by Google (Chubby, Spanner).
• Always run odd-numbered clusters (3, 5) for fault tolerance.
• Used by etcd (Kubernetes), CockroachDB, Kafka KRaft, Consul, TiKV.