Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
2. Intermediate Architecture
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

8. The CAP Theorem

Consistency, availability, partitions, and real-world tradeoffs.

Feb 22, 20264 views0 likes0 fires
18px

[!CAUTION] CAP is about what happens during a network partition. When the network splits, you must choose: reject requests (consistency) or serve responses that may be stale (availability). This is not a theoretical exercise—it defines how your users experience outages.

The CAP Properties

Consistency (C)

Every read receives the most recent successful write (or returns an error). All nodes see the same data at the same time.

Availability (A)

Every request to a non-failing node gets a response—but the response may not include the latest write.

Partition Tolerance (P)

The system continues operating even if network messages between nodes are dropped or delayed. In real-world distributed systems, partitions are inevitable—cables get cut, switches fail, cloud regions lose connectivity.

Why Partition Tolerance Is Mandatory

If your system spans more than one node (multiple availability zones, regions, data centers), network partitions will happen. This is not a theoretical possibility—it''s a certainty. AWS, Google Cloud, and Azure all experience inter-region connectivity issues regularly. This means the real decision is: during a partition, do you prioritize Consistency or Availability?

CP Systems: Correctness Over Uptime

When the network splits, a CP system rejects some requests rather than risk returning incorrect data.

Real-World Example: Google Spanner

Google Spanner is a globally distributed SQL database that chooses Consistency + Partition Tolerance. It uses GPS-synchronized atomic clocks (called TrueTime) across data centers to guarantee that every read sees the latest write, globally. When a partition occurs, Spanner may briefly reject writes to maintain consistency. Google uses Spanner for Google Ads billing—where even a single inconsistent read could mean billions in mischarges.

  • Best for: Banking, payments, inventory counts, uniqueness constraints (anything where "wrong" data is worse than "no" data).
  • Trade-off: Some user actions temporarily fail. Clients must handle retries gracefully.
  • Other CP systems: ZooKeeper (used by Kafka for leader election), etcd (used by Kubernetes), HBase.

AP Systems: Uptime Over Correctness

When the network splits, an AP system keeps answering requests—even if some responses are stale or need later reconciliation.

Real-World Example: Amazon DynamoDB & the Shopping Cart

Amazon designed DynamoDB (originally Dynamo) with Availability + Partition Tolerance because they learned a critical business lesson: an unavailable shopping cart costs more money than a slightly stale shopping cart. If a user adds an item to their cart during a network partition, DynamoDB accepts the write on whichever node is reachable. When the partition heals, it reconciles using vector clocks and "last writer wins" conflict resolution. The user might briefly see an older version of their cart, but the system never goes down.

  • Best for: Social feeds, likes, view counters, analytics, caches—anything where being slightly stale is acceptable.
  • Trade-off: Conflict resolution complexity. You need strategies for reconciling diverged data.
  • Other AP systems: Cassandra (Discord, Instagram), CouchDB, Riak.

Common CAP Misconceptions

  • "Pick any 2 of 3": Misleading. In modern distributed systems, P is not optional—you''re really choosing between C and A during partitions.
  • "CAP applies during normal operation": Wrong. CAP specifically describes behavior during partition events. When the network is healthy, you can have both consistency and availability.
  • "Eventual consistency = no consistency": Wrong. Eventually consistent systems do converge to a consistent state—they just don''t guarantee when. Twitter''s follower count might lag by a few seconds, but it eventually catches up.

Practical Consistency Tools

Real systems tune the consistency/availability balance with techniques like:

  • Quorums: Read from R replicas, write to W replicas out of N total. If R + W > N, you guarantee that reads and writes overlap—ensuring strong consistency. Cassandra lets you configure this per-query.
  • Leader-based replication: All writes go through one leader node, followers replicate asynchronously. PostgreSQL and MySQL default to this model.
  • Bounded staleness: "Serve data that is at most 5 seconds old." Azure Cosmos DB offers this as a configurable consistency level.

PACELC: The More Complete Picture

PACELC extends CAP: if there is a Partition, choose A or C; Else (no partition), choose Latency or Consistency. This explains real-world behavior better:

  • DynamoDB: PA/EL (Available during partition, Low latency otherwise)
  • Google Spanner: PC/EC (Consistent always, even at latency cost)
  • Cassandra: PA/EL (Available and fast, eventually consistent)
  • CockroachDB: PC/EC (Strong consistency, higher latency for cross-region writes)

[!TIP] In interviews, describe the partition explicitly (e.g., "Region A cannot reach Region B") and then explain what the user sees in each region under CP vs AP. Bonus points for mentioning PACELC and explaining behavior outside partition events.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

hardSystem Design
Quiz: CAP Theorem
5 questions5 min

Continue Learning

9. Microservices Architecture

Intermediate
14 min

10. Message Queues & Event Streaming

Intermediate
16 min

12. Design BookMyShow (Ticket Booking)

Advanced
24 min
Lesson 3 of 5 in 2. Intermediate Architecture
Previous in 2. Intermediate Architecture
7. Database Sharding & Partitioning
Next in 2. Intermediate Architecture
9. Microservices Architecture
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories