23. Unique ID Generation at Scale

[!NOTE] In a monolith with a single database, auto-incrementing IDs (id SERIAL) work fine. In a distributed system with multiple database shards, multiple services, and messages flowing through queues, you need IDs that are globally unique without coordination, roughly sortable by time, and fast to generate.

Why Auto-Increment Fails in Distributed Systems

Single point of failure: The database sequence is on one machine. If it goes down, no service can generate IDs.
Cross-shard conflicts: If Shard A and Shard B both have id SERIAL, they will both generate ID 1, 2, 3... — collisions everywhere.
Performance bottleneck: Every service must round-trip to the database just to get an ID, even before doing useful work.

UUID v4 (Random)

UUID v4 generates 128-bit random identifiers: 550e8400-e29b-41d4-a716-446655440000

Pro: No coordination needed — any node can generate one instantly. Collision probability is astronomically low (2¹²² possibilities).
Con: Not sortable by time. Random values cause poor B-tree index performance in databases — new inserts scatter across the index instead of appending to the end, causing excessive page splits.

UUID v7 (Timestamp-Ordered)

UUID v7 (RFC 9562, 2024) embeds a Unix timestamp in the first 48 bits, followed by random bits:

  Timestamp (48 bits)  |  Random (80 bits)
  ─────────────────────┼──────────────────
  018f5e3c-7a00       |  7xxx-xxxx-......

Pro: Sortable by time (newer UUIDs sort after older ones). B-tree friendly — new inserts always append near the end. No coordination needed.
Con: Slightly less random than v4 (but still astronomically unique). Not universally supported yet (newer standard).

Twitter Snowflake

Twitter created the Snowflake ID generator in 2010 to produce unique, time-sorted, 64-bit integers at massive scale. A Snowflake ID is a 64-bit integer packed as:

  0 | Timestamp (41 bits)     | Datacenter (5 bits) | Machine (5 bits) | Sequence (12 bits)
  ─┼─────────────────────────┼────────────────────┼──────────────────┼────────────────────
    | Milliseconds since epoch| 32 datacenters     | 32 machines/DC   | 4096 IDs/ms/machine

Capacity: 4,096 IDs per millisecond per machine × 1,024 machines = ~4 million IDs per millisecond.
Sortable: Timestamp is in the most significant bits, so IDs are naturally sorted by creation time.
No coordination: Each machine generates IDs independently using its unique datacenter + machine ID.
Compact: 64-bit integer fits in a single database BIGINT column — half the size of a UUID.

Clock Skew: The Achilles'' Heel

Snowflake relies on system clocks. If a machine''s clock drifts backward (e.g., after NTP correction), it could generate duplicate IDs. Solutions:

Wait until the clock catches up (Snowflake''s approach — refuse to generate IDs during clock regression).
Use a logical clock component that tracks the highest seen timestamp.

Comparison Table

Method	Size	Sortable	Coordination	DB Index Performance	Collision Risk
DB Auto-Increment	32/64 bit	Yes	Centralized DB	Excellent	None (single source)
UUID v4	128 bit	No	None	Poor (random scattering)	Near zero
UUID v7	128 bit	Yes	None	Good (time-ordered)	Near zero
Snowflake	64 bit	Yes	Machine ID assignment	Excellent	Near zero

Real-World Usage

Twitter: Invented Snowflake. Every tweet ID is a Snowflake — you can extract the timestamp from any tweet ID.
Discord: Uses a Snowflake variant for message IDs. Since IDs are time-sorted, fetching "messages after ID X" is a simple range query.
Instagram: Uses a modified Snowflake that combines a timestamp with a shard ID and a per-shard auto-increment sequence, embedded in a 64-bit integer.

Snowflake ID: Implementation Deep Dive

Snowflake ID Structure (64 bits):

| 1 bit  |  41 bits         | 10 bits    | 12 bits      |
| unused |  timestamp (ms)  | machine ID | sequence num |

timestamp: milliseconds since custom epoch (e.g., 2020-01-01)
  → 2^41 ms = ~69 years before overflow
machine_id: unique per server (1024 possible machines)
sequence: per-millisecond counter (4096 IDs per ms per machine)

Max throughput per machine: 4,096,000 IDs/second
Max throughput (1024 machines): ~4 billion IDs/second

class SnowflakeGenerator:
    EPOCH = 1577836800000  # 2020-01-01 00:00:00 UTC

    def __init__(self, machine_id):
        self.machine_id = machine_id  # 0-1023
        self.sequence = 0
        self.last_timestamp = -1

    def generate(self):
        timestamp = current_time_ms() - self.EPOCH

        if timestamp == self.last_timestamp:
            self.sequence = (self.sequence + 1) & 0xFFF  # 12-bit mask
            if self.sequence == 0:
                timestamp = wait_for_next_ms(self.last_timestamp)
        else:
            self.sequence = 0

        if timestamp < self.last_timestamp:
            raise ClockMovedBackwardsError()  # Handle NTP skew!

        self.last_timestamp = timestamp

        return ((timestamp << 22) |
                (self.machine_id << 12) |
                self.sequence)

Database Index Impact: UUID v4 vs v7 Benchmark

Metric	UUID v4 (random)	UUID v7 (time-sorted)	Snowflake (64-bit)	Auto-increment
Insert throughput	~5,000/sec	~12,000/sec	~15,000/sec	~15,000/sec
Index size (1M rows)	~80 MB	~80 MB	~40 MB	~30 MB
Index fragmentation	Very high (random)	Low (sequential)	Low (sequential)	None
B-tree page splits	Frequent	Rare (append-only)	Rare	Never
Distributed-friendly	Yes	Yes	Yes (needs machine_id)	No (SPOF)

Why UUID v4 is slow for inserts: B-tree indexes are optimized for sequential inserts (new rows go to the rightmost leaf). Random UUIDs cause inserts across all leaf pages, triggering constant page splits and cache misses. UUID v7 and Snowflake IDs are time-sorted, so inserts are always sequential — same performance as auto-increment.

Choosing the Right ID Strategy

Use Case	Best Choice	Why
Simple web app, single DB	Auto-increment	Simplest, best performance
Microservices, needs coordination-free	UUID v7	Time-sorted, no central authority
High-throughput (>100K IDs/sec)	Snowflake	Compact (64-bit), sortable, fast
Public-facing URLs	NanoID or hashids	Short, URL-safe, non-guessable
Globally unique, don''t care about sorting	UUID v4	Simplest distributed option
Short URL slugs	Base62 encoded counter	Compact, human-readable

Common Mistakes

❌ Using UUID v4 as a primary key in a relational database — random inserts destroy B-tree performance. Use UUID v7 or Snowflake instead.
❌ Exposing sequential IDs in URLs — users can enumerate resources (e.g., /user/1, /user/2, /user/3). Use non-guessable IDs for public-facing resources.
❌ Assuming clocks are always monotonic — NTP adjustments can move clocks backward. Snowflake implementations must handle this with a wait or error.
❌ Using 128-bit UUIDs when 64-bit Snowflake would suffice — doubles your index size for no benefit. Use Snowflake if you control the ID generation infrastructure.
❌ Not planning for machine_id assignment — in Snowflake, each server needs a unique machine_id. Use ZooKeeper, Consul, or Kubernetes pod index for assignment.

[!TIP] Key Takeaways:
• Auto-increment fails in distributed systems (SPOF, cross-shard collisions).
• UUID v4: simple but not sortable and bad for DB indexes. Avoid as primary key.
• UUID v7: sortable, DB-friendly, no coordination. Best default choice for new systems.
• Snowflake: 64-bit, sortable, compact (4M IDs/sec/machine). Best for high-throughput systems.
• Clock skew is real. Never assume monotonic system time in distributed environments.
• Use NanoID or hashids for short, URL-safe, non-guessable public identifiers.