21. Distributed Transactions (Saga, 2PC, Outbox)

[!CAUTION] In a monolith, you wrap related operations in a single database transaction: either everything commits or everything rolls back. In a microservices world, each service owns its own database. There is no single transaction that can span them. Getting distributed consistency wrong leads to data corruption, lost orders, double charges, and angry customers.

Two-Phase Commit (2PC)

The classical approach to distributed transactions. A coordinator orchestrates multiple participants:

Phase 1 (Prepare): Coordinator asks all participants: "Can you commit?" Each participant acquires locks, writes to a WAL (write-ahead log), and responds YES or NO.
Phase 2 (Commit/Abort): If ALL participants said YES → coordinator sends COMMIT to all. If ANY said NO → coordinator sends ABORT to all.

Coordinator: "Prepare?"  →  [Service A]  →  YES
                         →  [Service B]  →  YES
                         →  [Service C]  →  YES

Coordinator: "Commit!"   →  [All services commit]

Why 2PC Is Problematic at Scale

Problem	Impact
Blocking	If the coordinator crashes after Phase 1 but before Phase 2, all participants are stuck holding locks, waiting forever for a decision.
High latency	Two network round trips minimum. With cross-datacenter latency, this adds 100–200ms to every transaction.
Tight coupling	All services must be available simultaneously. One unavailable service blocks the entire transaction.

Real-world: Google Spanner uses a form of 2PC with their TrueTime API (GPS-synchronized clocks), but this requires custom hardware that most companies don''t have. For everyone else, 2PC at scale is generally avoided.

The Saga Pattern

A Saga is a sequence of local transactions. Each service performs its own transaction and publishes an event. If any step fails, compensating transactions undo the previous steps.

Example: Order Processing

Step 1: Order Service   → Create Order (status: PENDING)
Step 2: Payment Service → Charge Credit Card
Step 3: Inventory Service → Reserve Items
Step 4: Shipping Service → Schedule Delivery

If Step 3 fails (out of stock):
  Compensate Step 2: Refund Credit Card
  Compensate Step 1: Cancel Order

Choreography vs Orchestration

Style	How	Pros	Cons
Choreography	Each service listens for events and decides what to do next	Decoupled, no central point of failure	Hard to understand the overall flow; debugging is painful
Orchestration	A central orchestrator tells each service what to do next	Clear flow, easy to monitor and debug	Orchestrator = single point of failure, coupling risk

Real-world: Uber uses an orchestrated saga for ride payments. The payment orchestrator coordinates: authorize card → complete ride → capture payment → pay driver. If any step fails, compensating transactions reverse the previous steps.

The Outbox Pattern

A common problem: Your service needs to update its database AND publish an event to a message queue. If the database update succeeds but the event publish fails (or vice versa), you have inconsistency.

The Outbox Pattern solves this:

Write the event to an "outbox" table in the same database transaction as the business data.
A separate poller (or CDC stream) reads from the outbox table and publishes events to the message queue.
Since the event and the data are in the same transaction, they are atomically consistent.

BEGIN TRANSACTION;
  UPDATE orders SET status = ''confirmed'' WHERE id = 123;
  INSERT INTO outbox (event_type, payload) VALUES (''order_confirmed'', ''{"id":123}'');
COMMIT;

-- Poller/CDC reads outbox and publishes to Kafka

Real-world: Shopify uses the outbox pattern with Kafka to ensure reliable event publishing across their microservices.

Idempotency: The Safety Net

In distributed systems, messages can be delivered more than once (network retries, consumer restarts). Every write operation must be idempotent — processing the same message twice should have the same effect as processing it once.

Common approach: include an idempotency key (e.g., a UUID) in every request. Before processing, check if that key was already handled:

POST /payments
Header: Idempotency-Key: 550e8400-e29b-41d4-a716
Body: {"amount": 100, "currency": "USD"}

-- Server checks: have I seen key 550e8400...? 
-- If yes → return cached response  
-- If no → process payment, store key + response

Real-world: Stripe requires an Idempotency-Key header on all payment creation requests. This prevents double charges when a client retries due to network timeout.

CRDTs: Conflict-Free Replicated Data Types

For certain data structures, you can avoid all coordination entirely. A CRDT is a data type designed so that concurrent updates from different nodes can always be merged without conflict:

G-Counter (Grow-only Counter): Each node maintains its own counter; the total is the sum. Used by Cassandra for distributed counters.
OR-Set (Observed-Remove Set): Add and remove elements concurrently without conflicts. Used by Redis Enterprise for active-active geo-replication.
LWW-Register (Last Writer Wins Register): Simple but loses one concurrent write. Used by DynamoDB.

CRDTs are ideal for collaborative editing (Figma uses CRDTs for real-time multiplayer design), distributed counters, and shopping carts.

The Monolith-First Strategy

[!IMPORTANT] Before going distributed, ask: do you actually need microservices? Distributed transactions add enormous complexity. Many successful companies run monoliths at impressive scale.

Amazon Prime Video Case Study (2023): Amazon''s Prime Video team publicly shared that they moved from a microservices architecture back to a monolith for their video monitoring service. The microservices version was more expensive, harder to operate, and the distributed coordination overhead was unnecessary for their use case. The monolith was 90% cheaper and simpler to maintain.

The lesson: start with a monolith. Extract services only when you have a clear reason (independent scaling, team autonomy, different deployment cadences).

Saga Failure Handling: A Complete Example

Consider an e-commerce order saga with 4 steps:

Order Saga (Orchestration):

Step 1: Create Order      → order_service.createOrder()
Step 2: Reserve Inventory  → inventory_service.reserve()
Step 3: Charge Payment     → payment_service.charge()
Step 4: Ship Order         → shipping_service.schedule()

HAPPY PATH:
  Step 1 ✓ → Step 2 ✓ → Step 3 ✓ → Step 4 ✓ → DONE!

FAILURE AT STEP 3 (payment declined):
  Step 1 ✓ → Step 2 ✓ → Step 3 ✗ → COMPENSATE!
  Compensate Step 2: inventory_service.unreserve()
  Compensate Step 1: order_service.cancel()
  → Order marked as FAILED, user notified

The Outbox Pattern in Detail

The outbox pattern solves the dual-write problem — how to atomically update a database AND publish an event:

-- Instead of this (DANGEROUS - not atomic):
UPDATE orders SET status = ''confirmed'' WHERE id = 123;
kafka.publish(''order.confirmed'', {order_id: 123});  -- what if this fails?

-- Use this (SAFE - atomic):
BEGIN TRANSACTION;
  UPDATE orders SET status = ''confirmed'' WHERE id = 123;
  INSERT INTO outbox (event_type, payload, created_at)
  VALUES (''order.confirmed'', ''{"order_id": 123}'', NOW());
COMMIT;

-- Separate process (CDC or poller) reads outbox and publishes:
-- Debezium captures outbox inserts → publishes to Kafka
-- After successful publish, mark outbox row as processed

Used by: Shopify (order processing), Uber (trip events), Netflix (content updates).

Idempotency in Practice

// Client sends request with idempotency key:
POST /api/v1/payments
Headers: { "Idempotency-Key": "pay_abc123xyz" }
Body: { "amount": 50.00, "currency": "USD" }

// Server implementation:
async function processPayment(req) {
  const key = req.headers[''idempotency-key''];
  
  // Check if we''ve seen this key before
  const existing = await redis.get(idempotency:${key});
  if (existing) return JSON.parse(existing);  // Return cached response
  
  // Process the payment
  const result = await stripe.charges.create({...});
  
  // Cache the result (TTL: 24 hours)
  await redis.setex(idempotency:${key}, 86400, JSON.stringify(result));
  return result;
}

When to Use Which Pattern

Scenario	Best Pattern	Why
Single database, multiple tables	Database transaction (ACID)	No distribution needed
2-3 services, strong consistency needed	2PC (if latency is acceptable)	Simplest correct solution
Multiple services, eventual consistency OK	Saga (orchestrated)	Clear flow, good error handling
Event-driven architecture	Saga (choreographed) + Outbox	Loose coupling, scalable
High-throughput, multi-region	CRDTs	No coordination overhead

Common Mistakes

❌ Using 2PC across microservices — the blocking and coupling make it impractical at scale. Use sagas instead.
❌ Forgetting compensating transactions — sagas without compensation logic leave partial state on failure. Every forward step needs a reverse step.
❌ Not making operations idempotent — in a retry-heavy distributed system, duplicate processing is inevitable. Always use idempotency keys.
❌ Going microservices too early — distributed transactions are hard. If you can avoid them with a monolith, do it.
❌ Dual writes without outbox — writing to a DB and publishing to Kafka separately is a recipe for inconsistency. Always use the outbox pattern.

[!TIP] Key Takeaways:
• 2PC: strong consistency but blocking, slow, and tightly coupled — avoid at scale.
• Saga: sequence of local transactions + compensations. Choreography for loose coupling, orchestration for clarity.
• Outbox pattern: atomic DB write + event publish by writing events to an outbox table. Use Debezium for CDC.
• Idempotency keys: protect against duplicate processing (Stripe, AWS). Cache results in Redis with TTL.
• CRDTs: conflict-free data structures that merge without coordination (counters, sets, registers).
• Consider monolith-first. Microservices are a last resort, not a starting point.