[!CAUTION] In any distributed system with replication, you need nodes to agree on things: who is the leader, what is the committed state, what order did events happen in. Getting this wrong leads to split-brain — two nodes both think they are the leader, accept conflicting writes, and corrupt your data. Consensus algorithms like Raft and Paxos prevent this.
The Split-Brain Problem
Imagine a database cluster with a leader and two followers. A network partition occurs, isolating the leader from one follower:
[Leader] ←──X──→ [Follower B]
│ │
▼ ▼
[Follower A] (isolated)
If Follower B cannot reach the Leader, it might conclude the leader is dead and promote itself to leader. Now you have two leaders accepting writes independently. When the partition heals, the data has diverged — which writes win? This is the split-brain problem, and it has caused real data loss at companies including GitHub (2012 MySQL incident).
Raft Consensus Algorithm
Raft was designed in 2013 by Diego Ongaro as a understandable alternative to Paxos. It decomposes consensus into three sub-problems: leader election, log replication, and safety.
Node States
Every node in a Raft cluster is in one of three states at any time:
- Leader: Handles all client requests, replicates log entries to followers.
- Follower: Passive — only responds to RPCs from the leader.
- Candidate: Temporarily — a follower trying to become the new leader.
Step-by-Step: Leader Election
- All nodes start as Followers.
- Each follower has a randomized election timeout (e.g., 150–300ms). If a follower does not hear from a leader before its timeout expires, it becomes a Candidate.
- The Candidate increments its term number (a logical clock) and sends
RequestVoteRPCs to all other nodes. - Each node votes for at most one candidate per term (first-come-first-served).
- If the Candidate receives votes from a majority (quorum), it becomes the new Leader.
- The new Leader immediately sends heartbeat messages to all followers to assert authority and prevent new elections.
Term 1: [A:Leader] ──heartbeat──→ [B:Follower] [C:Follower]
(A crashes)
Term 2: [A:down] [B:Candidate] ──RequestVote──→ [C:Follower]
[B] gets vote from C (majority: 2/3)
[B:Leader] ──heartbeat──→ [C:Follower]
Log Replication
Once elected, the leader handles all writes:
- Client sends a write request to the Leader.
- Leader appends the entry to its local log.
- Leader sends the entry to all Followers via
AppendEntriesRPC. - When a majority of nodes have written the entry to their logs, the Leader considers it committed and applies it to the state machine.
- Leader notifies the client of success.
This ensures that a committed entry survives the failure of any minority of nodes.
Why Randomized Timeouts?
If all followers had the same election timeout, they would all become candidates simultaneously and split the vote (no majority). Randomized timeouts ensure that one node times out first, gets elected quickly, and suppresses other elections with its heartbeat.
Paxos: The Original (and Harder) Algorithm
Paxos was invented by Leslie Lamport in 1989. It solves the same problem as Raft but is notoriously difficult to understand and implement. Even Lamport''s original paper used an analogy of Greek parliament to explain it.
Paxos has three roles: Proposer (suggests a value), Acceptor (votes on proposals), and Learner (learns the decided value). The core guarantee: once a majority of acceptors agree on a value, that value is decided and can never be changed.
| Property | Raft | Paxos | ZAB (ZooKeeper) |
|---|---|---|---|
| Understandability | Designed to be easy | Notoriously hard | Moderate |
| Leader required? | Yes (strong leader) | No (leaderless Paxos exists) | Yes |
| Notable users | etcd, CockroachDB, Consul, TiKV | Google Chubby, Spanner | Apache ZooKeeper, Kafka (old) |
| Ordering guarantee | Total order via log index | Per-slot agreement | Total order via zxid |
Real-World Usage
etcd & Kubernetes
Kubernetes stores all cluster state (pods, services, deployments, configs) in etcd, a distributed key-value store that uses Raft. When you run kubectl apply, your config is written to the etcd leader, which replicates it to followers. If the etcd leader crashes, Raft automatically elects a new leader. This is why Kubernetes recommends running 3 or 5 etcd nodes — a quorum of 2/3 or 3/5 survives one or two failures.
CockroachDB
CockroachDB is a distributed SQL database that uses Raft for consensus on every data range. Each range (a partition of the keyspace) has its own Raft group with a leader and followers. This gives CockroachDB strong consistency (serializable isolation) across a globally distributed cluster — something MySQL and PostgreSQL alone cannot do.
Kafka KRaft
Apache Kafka historically depended on ZooKeeper for leader election and metadata management. Since Kafka 3.x, they have been replacing ZooKeeper with KRaft — a built-in Raft implementation. This removes an entire operational dependency and simplifies Kafka cluster management.
Raft Step-by-Step: A Complete Example
Let''s trace through a complete Raft scenario with 5 nodes:
Initial state: Node A is leader, term=3
Client sends write "x=5" to Node A (leader)
Node A appends "x=5" to its log at index 7
Node A sends AppendEntries RPC to B, C, D, E
Nodes B, C, D respond with success (E is slow)
Node A has 4/5 confirmations (majority!) → commits entry
Node A responds to client: "success"
Node A notifies followers of committed index
Followers apply "x=5" to their state machines
--- Leader failure ---9. Node A crashes. Nodes B-E stop receiving heartbeats10. After random timeout (150-300ms), Node C starts election11. Node C increments term to 4, votes for itself12. Node C requests votes from B, D, E13. B and D grant votes (C has the most up-to-date log)14. C has 3/5 votes → becomes new leader for term 415. C starts sending heartbeats to B, D, E16. When A recovers, it sees term=4, steps down to follower
Paxos vs Raft: Detailed Comparison
| Aspect | Raft | Paxos | Winner |
|---|---|---|---|
| Understandability | Designed for clarity — 3 subproblems | Notoriously hard to understand | Raft |
| Leader requirement | Requires a stable leader | Leaderless (Multi-Paxos adds leader) | Paxos (flexibility) |
| Implementation | etcd, Consul, CockroachDB, TiKV | Google Chubby, Spanner | Tie |
| Throughput | Limited by leader bottleneck | Can parallelize proposals | Paxos |
| Latency | 2 RTTs (1 replication + 1 commit) | 2 RTTs (prepare + accept) | Tie |
| Industry adoption | Widely adopted, many open-source libs | Mostly Google internally | Raft |
ZooKeeper and ZAB Protocol
Apache ZooKeeper uses a Raft-like protocol called ZAB (ZooKeeper Atomic Broadcast) for consensus. Key systems that depend on ZooKeeper:
- Kafka (legacy): Broker metadata, topic configs, controller election. Being replaced by KRaft.
- HBase: Region server tracking, master election.
- Hadoop HDFS: NameNode high-availability with automatic failover.
- Apache Solr: Leader election for SolrCloud.
Byzantine Fault Tolerance (BFT)
Raft and Paxos assume nodes are honest but might crash (crash-fault tolerance). Byzantine faults are when nodes can lie, cheat, or act maliciously. BFT protocols (like PBFT, used in some blockchains) can tolerate up to f malicious nodes out of 3f+1 total — but at a significant performance cost.
| Fault Model | Tolerated Failures | Examples |
|---|---|---|
| Crash Fault Tolerance (CFT) | f failures out of 2f+1 nodes | Raft, Paxos, ZAB |
| Byzantine Fault Tolerance (BFT) | f failures out of 3f+1 nodes | PBFT, Tendermint |
For internal distributed systems (databases, queues), CFT (Raft/Paxos) is sufficient. BFT is only needed when you can''t trust the participants (e.g., public blockchains).
Common Mistakes
- ❌ Running an even number of consensus nodes — 4 nodes tolerate 1 failure (same as 3 nodes!). Always use odd numbers: 3, 5, or 7.
- ❌ Implementing your own consensus algorithm — the bugs are subtle and catastrophic. Use battle-tested implementations (etcd, ZooKeeper, Consul).
- ❌ Confusing consensus with replication — replication copies data; consensus ensures agreement on the order and content of that data.
- ❌ Not tuning election timeouts — too short causes unnecessary elections (thrashing); too long means slow failover. Typical Raft range: 150-300ms.
- ❌ Assuming consensus is free — every write requires a majority quorum round-trip. High-throughput systems shard data into many consensus groups (e.g., CockroachDB ranges).
[!TIP] Key Takeaways:
• Split-brain = two leaders = data corruption. Consensus prevents this.
• Raft: leader election via randomized timeouts + majority vote, log replication via quorum.
• Paxos: older, harder, but used by Google (Chubby, Spanner).
• Always run odd-numbered clusters (3, 5) for fault tolerance.
• Used by etcd (Kubernetes), CockroachDB, Kafka KRaft, Consul, TiKV.