Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
5. Advanced Concepts & Patterns
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

23. Unique ID Generation at Scale

How to generate billions of unique, sortable, distributed-friendly identifiers without coordination.

Mar 5, 202616 views0 likes0 fires
18px

[!NOTE] In a monolith with a single database, auto-incrementing IDs (id SERIAL) work fine. In a distributed system with multiple database shards, multiple services, and messages flowing through queues, you need IDs that are globally unique without coordination, roughly sortable by time, and fast to generate.

Why Auto-Increment Fails in Distributed Systems

  • Single point of failure: The database sequence is on one machine. If it goes down, no service can generate IDs.
  • Cross-shard conflicts: If Shard A and Shard B both have id SERIAL, they will both generate ID 1, 2, 3... — collisions everywhere.
  • Performance bottleneck: Every service must round-trip to the database just to get an ID, even before doing useful work.

UUID v4 (Random)

UUID v4 generates 128-bit random identifiers: 550e8400-e29b-41d4-a716-446655440000

  • Pro: No coordination needed — any node can generate one instantly. Collision probability is astronomically low (2122 possibilities).
  • Con: Not sortable by time. Random values cause poor B-tree index performance in databases — new inserts scatter across the index instead of appending to the end, causing excessive page splits.

UUID v7 (Timestamp-Ordered)

UUID v7 (RFC 9562, 2024) embeds a Unix timestamp in the first 48 bits, followed by random bits:

  Timestamp (48 bits)  |  Random (80 bits)
  ─────────────────────┼──────────────────
  018f5e3c-7a00       |  7xxx-xxxx-......
  • Pro: Sortable by time (newer UUIDs sort after older ones). B-tree friendly — new inserts always append near the end. No coordination needed.
  • Con: Slightly less random than v4 (but still astronomically unique). Not universally supported yet (newer standard).

Twitter Snowflake

Twitter created the Snowflake ID generator in 2010 to produce unique, time-sorted, 64-bit integers at massive scale. A Snowflake ID is a 64-bit integer packed as:

  0 | Timestamp (41 bits)     | Datacenter (5 bits) | Machine (5 bits) | Sequence (12 bits)
  ─┼─────────────────────────┼────────────────────┼──────────────────┼────────────────────
    | Milliseconds since epoch| 32 datacenters     | 32 machines/DC   | 4096 IDs/ms/machine
  • Capacity: 4,096 IDs per millisecond per machine × 1,024 machines = ~4 million IDs per millisecond.
  • Sortable: Timestamp is in the most significant bits, so IDs are naturally sorted by creation time.
  • No coordination: Each machine generates IDs independently using its unique datacenter + machine ID.
  • Compact: 64-bit integer fits in a single database BIGINT column — half the size of a UUID.

Clock Skew: The Achilles'' Heel

Snowflake relies on system clocks. If a machine''s clock drifts backward (e.g., after NTP correction), it could generate duplicate IDs. Solutions:

  • Wait until the clock catches up (Snowflake''s approach — refuse to generate IDs during clock regression).
  • Use a logical clock component that tracks the highest seen timestamp.

Comparison Table

MethodSizeSortableCoordinationDB Index PerformanceCollision Risk
DB Auto-Increment32/64 bitYesCentralized DBExcellentNone (single source)
UUID v4128 bitNoNonePoor (random scattering)Near zero
UUID v7128 bitYesNoneGood (time-ordered)Near zero
Snowflake64 bitYesMachine ID assignmentExcellentNear zero

Real-World Usage

  • Twitter: Invented Snowflake. Every tweet ID is a Snowflake — you can extract the timestamp from any tweet ID.
  • Discord: Uses a Snowflake variant for message IDs. Since IDs are time-sorted, fetching "messages after ID X" is a simple range query.
  • Instagram: Uses a modified Snowflake that combines a timestamp with a shard ID and a per-shard auto-increment sequence, embedded in a 64-bit integer.

Snowflake ID: Implementation Deep Dive

Snowflake ID Structure (64 bits):

| 1 bit  |  41 bits         | 10 bits    | 12 bits      |
| unused |  timestamp (ms)  | machine ID | sequence num |

timestamp: milliseconds since custom epoch (e.g., 2020-01-01)
  → 2^41 ms = ~69 years before overflow
machine_id: unique per server (1024 possible machines)
sequence: per-millisecond counter (4096 IDs per ms per machine)

Max throughput per machine: 4,096,000 IDs/second
Max throughput (1024 machines): ~4 billion IDs/second
class SnowflakeGenerator:
    EPOCH = 1577836800000  # 2020-01-01 00:00:00 UTC

    def init(self, machine_id):
        self.machine_id = machine_id  # 0-1023
        self.sequence = 0
        self.last_timestamp = -1

    def generate(self):
        timestamp = current_time_ms() - self.EPOCH

        if timestamp == self.last_timestamp:
            self.sequence = (self.sequence + 1) & 0xFFF  # 12-bit mask
            if self.sequence == 0:
                timestamp = wait_for_next_ms(self.last_timestamp)
        else:
            self.sequence = 0

        if timestamp < self.last_timestamp:
            raise ClockMovedBackwardsError()  # Handle NTP skew!

        self.last_timestamp = timestamp

        return ((timestamp << 22) |
                (self.machine_id << 12) |
                self.sequence)

Database Index Impact: UUID v4 vs v7 Benchmark

MetricUUID v4 (random)UUID v7 (time-sorted)Snowflake (64-bit)Auto-increment
Insert throughput~5,000/sec~12,000/sec~15,000/sec~15,000/sec
Index size (1M rows)~80 MB~80 MB~40 MB~30 MB
Index fragmentationVery high (random)Low (sequential)Low (sequential)None
B-tree page splitsFrequentRare (append-only)RareNever
Distributed-friendlyYesYesYes (needs machine_id)No (SPOF)

Why UUID v4 is slow for inserts: B-tree indexes are optimized for sequential inserts (new rows go to the rightmost leaf). Random UUIDs cause inserts across all leaf pages, triggering constant page splits and cache misses. UUID v7 and Snowflake IDs are time-sorted, so inserts are always sequential — same performance as auto-increment.

Choosing the Right ID Strategy

Use CaseBest ChoiceWhy
Simple web app, single DBAuto-incrementSimplest, best performance
Microservices, needs coordination-freeUUID v7Time-sorted, no central authority
High-throughput (>100K IDs/sec)SnowflakeCompact (64-bit), sortable, fast
Public-facing URLsNanoID or hashidsShort, URL-safe, non-guessable
Globally unique, don''t care about sortingUUID v4Simplest distributed option
Short URL slugsBase62 encoded counterCompact, human-readable

Common Mistakes

  • ❌ Using UUID v4 as a primary key in a relational database — random inserts destroy B-tree performance. Use UUID v7 or Snowflake instead.
  • ❌ Exposing sequential IDs in URLs — users can enumerate resources (e.g., /user/1, /user/2, /user/3). Use non-guessable IDs for public-facing resources.
  • ❌ Assuming clocks are always monotonic — NTP adjustments can move clocks backward. Snowflake implementations must handle this with a wait or error.
  • ❌ Using 128-bit UUIDs when 64-bit Snowflake would suffice — doubles your index size for no benefit. Use Snowflake if you control the ID generation infrastructure.
  • ❌ Not planning for machine_id assignment — in Snowflake, each server needs a unique machine_id. Use ZooKeeper, Consul, or Kubernetes pod index for assignment.

[!TIP] Key Takeaways:
• Auto-increment fails in distributed systems (SPOF, cross-shard collisions).
• UUID v4: simple but not sortable and bad for DB indexes. Avoid as primary key.
• UUID v7: sortable, DB-friendly, no coordination. Best default choice for new systems.
• Snowflake: 64-bit, sortable, compact (4M IDs/sec/machine). Best for high-throughput systems.
• Clock skew is real. Never assume monotonic system time in distributed environments.
• Use NanoID or hashids for short, URL-safe, non-guessable public identifiers.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

mediumSystem Design
Quiz: Unique ID Generation
5 questions5 min

Continue Learning

24. Rate Limiting Algorithms

Intermediate
12 min

25. Circuit Breakers & Bulkhead Pattern

Intermediate
14 min

26. API Gateway, Proxies & Service Mesh

Intermediate
14 min
Lesson 7 of 12 in 5. Advanced Concepts & Patterns
Previous in 5. Advanced Concepts & Patterns
22. Event Sourcing & CQRS
Next in 5. Advanced Concepts & Patterns
24. Rate Limiting Algorithms
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories