Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
6. More Case Studies
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

30. Design a Chat System (WhatsApp)

Designing a real-time messaging system with presence, group chats, and offline delivery.

Mar 5, 20269 views0 likes0 fires
18px

[!NOTE] A chat system like WhatsApp handles 100 billion messages per day with real-time delivery, offline support, and end-to-end encryption. This is one of the most popular system design interview questions because it covers WebSockets, message queuing, presence tracking, and persistence — a comprehensive systems design challenge.

Step 1: Requirements

FeatureRequirement
1:1 messagingReal-time, with offline delivery
Group chatUp to 500 members
Presence (online/offline)Near-real-time status
Read receiptsDelivered + Read indicators
Push notificationsFor offline users
Scale50M DAU, 100B messages/day

Step 2: High-Level Design (v1)

                              ┌──────────────┐
  [User A] ──WebSocket──→     │              │
                              │  Chat Server  │
  [User B] ──WebSocket──→     │   (Stateful)  │
                              │              │
                              └──────┬───────┘
                                     │
                              ┌──────▼───────┐
                              │   Database    │
                              │(msg storage)  │
                              └──────────────┘

This works for a small scale, but a single server cannot handle millions of concurrent WebSocket connections. We need to scale out.

Step 3: Scaled Architecture (v2)

                    ┌──────────────────┐
                    │  Service Discovery │
                    │ (which server has  │
                    │  which user?)      │
                    └────────┬─────────┘
                             │
  [User A] ──WS──→ [Chat Server 1]
                         │
                    [Message Queue / Redis Pub/Sub]
                         │
  [User B] ──WS──→ [Chat Server 2]
                         │
                    ┌────▼─────┐    ┌──────────┐
                    │ Message   │    │  Push     │
                    │ Store     │    │  Service  │
                    │ (Cassandra)│   │ (FCM/APNs)│
                    └──────────┘    └──────────┘

Message Flow (1:1)

  1. User A sends a message via WebSocket to Chat Server 1.
  2. Chat Server 1 persists the message to the message store (status: SENT).
  3. Chat Server 1 looks up which server User B is connected to (via service discovery / Redis).
  4. If User B is online: route message via Redis pub/sub to Chat Server 2 → deliver via WebSocket → update status to DELIVERED.
  5. If User B is offline: send a push notification via FCM/APNs. When User B comes online, fetch undelivered messages from the message store.

Message Flow (Group Chat)

For a group with 500 members:

  • Fan-out on write: When User A sends a message to a group, write a copy to each member''s inbox. Expensive for large groups, but read is O(1).
  • Fan-out on read: Store the message once, and when each member reads, they query the group''s message feed. Cheaper writes, but reads are more expensive.

WhatsApp uses fan-out on write for small groups (<256 members) and fan-out on read for larger groups. This is a common hybrid approach.

Step 4: Data Model

Messages table (Cassandra — partitioned by conversation_id):
┌───────────────────┬─────────────────┬──────────┬──────────┬─────────┐
│ conversation_id   │ message_id      │ sender_id│ content  │ status  │
│ (partition key)   │ (clustering key)│          │          │         │
├───────────────────┼─────────────────┼──────────┼──────────┼─────────┤
│ conv_ab_123       │ snowflake_001   │ user_a   │ "Hello"  │DELIVERED│
│ conv_ab_123       │ snowflake_002   │ user_b   │ "Hi!"    │ READ    │
└───────────────────┴─────────────────┴──────────┴──────────┴─────────┘

Sorting: message_id (Snowflake) sorts by time automatically.

Why Cassandra? Chat messages are write-heavy (100B writes/day), partitioned by conversation (each conversation is a natural partition), and queried in time order (Snowflake IDs sort naturally). Cassandra excels at all three.

Step 5: Presence System

Showing "online" / "last seen" for each user:

  • Heartbeat approach: Each connected client sends a heartbeat every 30 seconds. If no heartbeat for 60 seconds, mark as offline.
  • Store presence in Redis: presence:{user_id} → {server_id, last_heartbeat} with a TTL of 60s.
  • When a friend opens a chat, check Redis for that user''s presence status.

Optimization for groups: Don''t broadcast individual presence updates to all 500 members. Instead, lazy-load presence when a user opens the group info screen.

Step 6: Read Receipts

Message lifecycle:
  SENT → DELIVERED → READ

  User A sends message (status: SENT)
  Server stores and delivers to User B (status: DELIVERED, notify User A)
  User B opens the chat (status: READ, notify User A)

Read receipts are sent back as lightweight WebSocket events. For group chats, aggregate: "Read by 15 of 20 members."

Message Delivery State Machine

Message Lifecycle:

  [Client A sends]
       ↓
  SENT (stored on server, ACK sent to Client A)
       ↓
  DELIVERED (pushed to Client B''s device, ACK sent back)
       ↓
  READ (Client B opens conversation, read receipt sent)

Server States per message:
  status: PENDING → SENT → DELIVERED → READ
  Each transition triggers a push to Client A (via WebSocket if online)

Offline handling:
  If Client B is offline → message stays as SENT
  When Client B reconnects → server pushes all SENT messages
  Client B ACKs each → status moves to DELIVERED

Cassandra Schema for Messages

CREATE TABLE messages (    conversation_id UUID,    message_id      TIMEUUID,   -- Snowflake-style, time-sortable    sender_id       UUID,    content         TEXT,    content_type    TEXT,       -- ''text'', ''image'', ''video''    created_at      TIMESTAMP,    PRIMARY KEY (conversation_id, message_id)) WITH CLUSTERING ORDER BY (message_id DESC);
-- Partition key: conversation_id--   → All messages in a conversation are on the same node--   → Efficient range queries: "get messages after ID X"
-- Query: Get latest 50 messages in a conversationSELECT

FROM messagesWHERE conversation_id = ?ORDER BY message_id DESCLIMIT 50;

Group Chat: Fan-Out Strategy

Group SizeStrategyHow It Works
Small (2-100 members)Fan-out on writeServer pushes message to each member''s inbox immediately. Fast reads.
Large (100-10K members)Fan-out on readMessage stored once in group''s channel. Members pull on open. Saves writes.
Broadcast (10K+)Pub/Sub channelMembers subscribe. Only online users receive in real-time. Others pull on reconnect.

WhatsApp''s approach: Groups are limited to 1,024 members. They use fan-out on write for all groups, keeping the architecture simple. The cap avoids the complexity of large-group fan-out.

End-to-End Encryption (E2EE)

WhatsApp and Signal use the Signal Protocol for E2EE. Key insight: the server never sees plaintext messages. It only stores encrypted blobs. This means the server cannot search, moderate, or read message content. The tradeoff: server-side features like search are impossible without client-side indexing.

Message flow with E2EE:

Client A encrypts message with Client B''s public key
Encrypted message → Server (stores encrypted blob)
Server pushes encrypted blob to Client B
Client B decrypts with their private key  Server sees: "aGVsbG8gd29ybGQ=" (encrypted)  Server cannot: search, moderate, recommend, or read content

Common Mistakes

  • ❌ Using a relational database for messages — SQL databases struggle with the write volume and partition requirements of chat. Use Cassandra, ScyllaDB, or similar.
  • ❌ Storing messages only in memory — users expect to see message history when they reinstall the app. Always persist to durable storage.
  • ❌ Broadcasting presence to all contacts — if a user has 1,000 contacts, going online sends 1,000 updates. Use lazy presence loading.
  • ❌ Not handling message ordering — use Snowflake/TIMEUUID IDs for consistent ordering across distributed servers.
  • ❌ No offline message queue — if a user is offline, messages must be queued and delivered on reconnect. Don''t drop them.

[!TIP] Key Takeaways:
• WebSocket for real-time delivery; push notifications for offline users.
• Redis pub/sub to route messages between chat servers.
• Cassandra for message storage: write-optimized, partitioned by conversation, time-sorted.
• Fan-out on write for small groups, fan-out on read for large groups.
• Message state machine: SENT → DELIVERED → READ. Each transition triggers a push update.
• E2EE means the server only stores encrypted blobs — cannot search or moderate.
• Presence via heartbeats + Redis TTL. Lazy-load for groups.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

hardSystem Design
Quiz: Design a Chat System
5 questions5 min

Continue Learning

31. Design YouTube (Video Streaming)

Advanced
18 min

32. Design a Web Crawler

Advanced
16 min
Lesson 1 of 3 in 6. More Case Studies
Next in 6. More Case Studies
31. Design YouTube (Video Streaming)
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories