Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
2. Intermediate Architecture
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

7. Database Sharding & Partitioning

Splitting massive tables across multiple machines.

Feb 22, 20265 views0 likes0 fires
18px

The Data Ceiling

Even the most expensive SQL server in the world eventually runs out of disk space, RAM, or CPU. If you are Instagram storing 2+ billion photos, or Uber logging billions of GPS traces per day, you physically cannot fit all the data on one machine. This is where Sharding (Horizontal Partitioning) comes in.

What is Sharding?

Sharding is the process of splitting a massive database table''s rows across multiple separate database servers (called "shards" or "nodes"). The application connects to a routing layer that figures out which physical server holds the requested data.

For example, if you have a massive Users table:

  • Shard 1: Users with IDs 1 to 1,000,000
  • Shard 2: Users with IDs 1,000,001 to 2,000,000
  • Shard 3: Users with IDs 2,000,001 to 3,000,000

Real-World Example: Instagram

Instagram shards their PostgreSQL database by user ID. Each shard holds a subset of users and all their associated data (photos, likes, comments). This lets them scale write throughput linearly—adding more shards means more write capacity. At scale, they run thousands of PostgreSQL shard instances across AWS.

The Shard Key Challenge

The logic that determines which shard holds a piece of data is called the Shard Key. Choosing the right shard key is arguably the hardest decision in database architecture. Get it wrong, and you create a Hotspot (also called the "Celebrity Problem").

Example of a Terrible Shard Key

Imagine you''re building a social network and you shard by Country:

  • Shard 1 handles the USA (330 million users)
  • Shard 2 handles Iceland (370,000 users)

During peak traffic, Shard 1 (USA) completely melts down under millions of requests per second, while Shard 2 (Iceland) sits idle at 1% CPU. You''ve defeated the entire purpose of sharding.

A Good Shard Key: Hash-Based

Use a hash function on user ID: shard = hash(user_id) % num_shards. This distributes data evenly regardless of geography, name, or signup date.

  • Pinterest shards by hashing user ID, ensuring even distribution across their MySQL fleet.
  • Discord shards their Cassandra clusters by channel ID, so active servers don''t overload a single node.

Consistent Hashing (The Modern Approach)

Simple modular hashing (hash % N) breaks catastrophically when you add or remove shards—every key gets remapped. Consistent hashing solves this by organizing shards on a virtual ring. When you add a shard, only ~1/N of the keys need to move, not all of them.

DynamoDB, Cassandra, and Riak all use consistent hashing internally. In interviews, mentioning consistent hashing when discussing resharding instantly shows depth.

Resharding: The Pain of Growth

What happens when your 4 shards are at capacity and you need to go to 8? Resharding is one of the most complex operations in database engineering:

  1. Dual-write: Write to both old and new shard layout simultaneously.
  2. Backfill: Copy historical data from old shards to new positions.
  3. Cutover: Switch reads to the new layout and stop writing to the old one.
  4. Cleanup: Remove duplicated data from old shards.

Vitess (used by YouTube and Slack) automates MySQL resharding. Without such tooling, resharding can take weeks of careful engineering.

The Scatter-Gather Problem

If a query doesn''t include the shard key, the routing layer must ask every shard for results and merge them. This is called scatter-gather and it''s expensive:

  • A 100-shard cluster means 100 parallel queries, plus merge time.
  • The slowest shard determines your overall latency (p99 suffers).

This is why designing your shard key around your most common queries is critical. If most queries are "get data for user X," then user_id is the right shard key.

[!IMPORTANT] Sharding is a last resort. Before sharding, exhaust simpler options: read replicas, caching, connection pooling, and table partitioning. Once you shard, cross-shard joins become nearly impossible, and every query must be shard-aware. Only shard when a single database truly cannot handle your throughput or storage needs.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

hardSystem Design
Quiz: Sharding
5 questions5 min

Continue Learning

8. The CAP Theorem

Intermediate
16 min

9. Microservices Architecture

Intermediate
14 min

10. Message Queues & Event Streaming

Intermediate
16 min
Lesson 2 of 5 in 2. Intermediate Architecture
Previous in 2. Intermediate Architecture
6. SQL vs NoSQL
Next in 2. Intermediate Architecture
8. The CAP Theorem
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories