[!CAUTION] In a monolith, you wrap related operations in a single database transaction: either everything commits or everything rolls back. In a microservices world, each service owns its own database. There is no single transaction that can span them. Getting distributed consistency wrong leads to data corruption, lost orders, double charges, and angry customers.
Two-Phase Commit (2PC)
The classical approach to distributed transactions. A coordinator orchestrates multiple participants:
- Phase 1 (Prepare): Coordinator asks all participants: "Can you commit?" Each participant acquires locks, writes to a WAL (write-ahead log), and responds YES or NO.
- Phase 2 (Commit/Abort): If ALL participants said YES → coordinator sends COMMIT to all. If ANY said NO → coordinator sends ABORT to all.
Coordinator: "Prepare?" → [Service A] → YES
→ [Service B] → YES
→ [Service C] → YES
Coordinator: "Commit!" → [All services commit]
Why 2PC Is Problematic at Scale
| Problem | Impact |
|---|---|
| Blocking | If the coordinator crashes after Phase 1 but before Phase 2, all participants are stuck holding locks, waiting forever for a decision. |
| High latency | Two network round trips minimum. With cross-datacenter latency, this adds 100–200ms to every transaction. |
| Tight coupling | All services must be available simultaneously. One unavailable service blocks the entire transaction. |
Real-world: Google Spanner uses a form of 2PC with their TrueTime API (GPS-synchronized clocks), but this requires custom hardware that most companies don''t have. For everyone else, 2PC at scale is generally avoided.
The Saga Pattern
A Saga is a sequence of local transactions. Each service performs its own transaction and publishes an event. If any step fails, compensating transactions undo the previous steps.
Example: Order Processing
Step 1: Order Service → Create Order (status: PENDING)
Step 2: Payment Service → Charge Credit Card
Step 3: Inventory Service → Reserve Items
Step 4: Shipping Service → Schedule Delivery
If Step 3 fails (out of stock):
Compensate Step 2: Refund Credit Card
Compensate Step 1: Cancel Order
Choreography vs Orchestration
| Style | How | Pros | Cons |
|---|---|---|---|
| Choreography | Each service listens for events and decides what to do next | Decoupled, no central point of failure | Hard to understand the overall flow; debugging is painful |
| Orchestration | A central orchestrator tells each service what to do next | Clear flow, easy to monitor and debug | Orchestrator = single point of failure, coupling risk |
Real-world: Uber uses an orchestrated saga for ride payments. The payment orchestrator coordinates: authorize card → complete ride → capture payment → pay driver. If any step fails, compensating transactions reverse the previous steps.
The Outbox Pattern
A common problem: Your service needs to update its database AND publish an event to a message queue. If the database update succeeds but the event publish fails (or vice versa), you have inconsistency.
The Outbox Pattern solves this:
- Write the event to an "outbox" table in the same database transaction as the business data.
- A separate poller (or CDC stream) reads from the outbox table and publishes events to the message queue.
- Since the event and the data are in the same transaction, they are atomically consistent.
BEGIN TRANSACTION;
UPDATE orders SET status = ''confirmed'' WHERE id = 123;
INSERT INTO outbox (event_type, payload) VALUES (''order_confirmed'', ''{"id":123}'');
COMMIT;
-- Poller/CDC reads outbox and publishes to Kafka
Real-world: Shopify uses the outbox pattern with Kafka to ensure reliable event publishing across their microservices.
Idempotency: The Safety Net
In distributed systems, messages can be delivered more than once (network retries, consumer restarts). Every write operation must be idempotent — processing the same message twice should have the same effect as processing it once.
Common approach: include an idempotency key (e.g., a UUID) in every request. Before processing, check if that key was already handled:
POST /payments
Header: Idempotency-Key: 550e8400-e29b-41d4-a716
Body: {"amount": 100, "currency": "USD"}
-- Server checks: have I seen key 550e8400...?
-- If yes → return cached response
-- If no → process payment, store key + response
Real-world: Stripe requires an Idempotency-Key header on all payment creation requests. This prevents double charges when a client retries due to network timeout.
CRDTs: Conflict-Free Replicated Data Types
For certain data structures, you can avoid all coordination entirely. A CRDT is a data type designed so that concurrent updates from different nodes can always be merged without conflict:
- G-Counter (Grow-only Counter): Each node maintains its own counter; the total is the sum. Used by Cassandra for distributed counters.
- OR-Set (Observed-Remove Set): Add and remove elements concurrently without conflicts. Used by Redis Enterprise for active-active geo-replication.
- LWW-Register (Last Writer Wins Register): Simple but loses one concurrent write. Used by DynamoDB.
CRDTs are ideal for collaborative editing (Figma uses CRDTs for real-time multiplayer design), distributed counters, and shopping carts.
The Monolith-First Strategy
[!IMPORTANT] Before going distributed, ask: do you actually need microservices? Distributed transactions add enormous complexity. Many successful companies run monoliths at impressive scale.
Amazon Prime Video Case Study (2023): Amazon''s Prime Video team publicly shared that they moved from a microservices architecture back to a monolith for their video monitoring service. The microservices version was more expensive, harder to operate, and the distributed coordination overhead was unnecessary for their use case. The monolith was 90% cheaper and simpler to maintain.
The lesson: start with a monolith. Extract services only when you have a clear reason (independent scaling, team autonomy, different deployment cadences).
Saga Failure Handling: A Complete Example
Consider an e-commerce order saga with 4 steps:
Order Saga (Orchestration):
Step 1: Create Order → order_service.createOrder()
Step 2: Reserve Inventory → inventory_service.reserve()
Step 3: Charge Payment → payment_service.charge()
Step 4: Ship Order → shipping_service.schedule()
HAPPY PATH:
Step 1 ✓ → Step 2 ✓ → Step 3 ✓ → Step 4 ✓ → DONE!
FAILURE AT STEP 3 (payment declined):
Step 1 ✓ → Step 2 ✓ → Step 3 ✗ → COMPENSATE!
Compensate Step 2: inventory_service.unreserve()
Compensate Step 1: order_service.cancel()
→ Order marked as FAILED, user notified
The Outbox Pattern in Detail
The outbox pattern solves the dual-write problem — how to atomically update a database AND publish an event:
-- Instead of this (DANGEROUS - not atomic):
UPDATE orders SET status = ''confirmed'' WHERE id = 123;
kafka.publish(''order.confirmed'', {order_id: 123}); -- what if this fails?
-- Use this (SAFE - atomic):
BEGIN TRANSACTION;
UPDATE orders SET status = ''confirmed'' WHERE id = 123;
INSERT INTO outbox (event_type, payload, created_at)
VALUES (''order.confirmed'', ''{"order_id": 123}'', NOW());
COMMIT;
-- Separate process (CDC or poller) reads outbox and publishes:
-- Debezium captures outbox inserts → publishes to Kafka
-- After successful publish, mark outbox row as processed
Used by: Shopify (order processing), Uber (trip events), Netflix (content updates).
Idempotency in Practice
// Client sends request with idempotency key:
POST /api/v1/payments
Headers: { "Idempotency-Key": "pay_abc123xyz" }
Body: { "amount": 50.00, "currency": "USD" }
// Server implementation:
async function processPayment(req) {
const key = req.headers[''idempotency-key''];
// Check if we''ve seen this key before
const existing = await redis.get(idempotency:${key});
if (existing) return JSON.parse(existing); // Return cached response
// Process the payment
const result = await stripe.charges.create({...});
// Cache the result (TTL: 24 hours)
await redis.setex(idempotency:${key}, 86400, JSON.stringify(result));
return result;
}
When to Use Which Pattern
| Scenario | Best Pattern | Why |
|---|---|---|
| Single database, multiple tables | Database transaction (ACID) | No distribution needed |
| 2-3 services, strong consistency needed | 2PC (if latency is acceptable) | Simplest correct solution |
| Multiple services, eventual consistency OK | Saga (orchestrated) | Clear flow, good error handling |
| Event-driven architecture | Saga (choreographed) + Outbox | Loose coupling, scalable |
| High-throughput, multi-region | CRDTs | No coordination overhead |
Common Mistakes
- ❌ Using 2PC across microservices — the blocking and coupling make it impractical at scale. Use sagas instead.
- ❌ Forgetting compensating transactions — sagas without compensation logic leave partial state on failure. Every forward step needs a reverse step.
- ❌ Not making operations idempotent — in a retry-heavy distributed system, duplicate processing is inevitable. Always use idempotency keys.
- ❌ Going microservices too early — distributed transactions are hard. If you can avoid them with a monolith, do it.
- ❌ Dual writes without outbox — writing to a DB and publishing to Kafka separately is a recipe for inconsistency. Always use the outbox pattern.
[!TIP] Key Takeaways:
• 2PC: strong consistency but blocking, slow, and tightly coupled — avoid at scale.
• Saga: sequence of local transactions + compensations. Choreography for loose coupling, orchestration for clarity.
• Outbox pattern: atomic DB write + event publish by writing events to an outbox table. Use Debezium for CDC.
• Idempotency keys: protect against duplicate processing (Stripe, AWS). Cache results in Redis with TTL.
• CRDTs: conflict-free data structures that merge without coordination (counters, sets, registers).
• Consider monolith-first. Microservices are a last resort, not a starting point.