[!CAUTION] In a microservices architecture, a single slow or failing service can bring down your entire system. If Service A calls Service B and B is slow, A''s threads pool up waiting. Soon A becomes slow too, causing its callers (Service C) to queue up. This cascading failure can propagate through your entire system in seconds. Circuit breakers and bulkheads prevent this.
The Circuit Breaker Pattern
Inspired by electrical circuit breakers, this pattern wraps calls to an external service with a state machine that monitors failures:
┌───────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ Failure threshold ┌──────────┐ │
│ CLOSED │ ────────────────────→ │ OPEN │ │
│(normal) │ │(fail fast)│ │
└────┬────┘ └────┬─────┘ │
│ │ │
│ After timeout │
│ │ │
│ ┌─────▼──────┐ │
│ │ HALF-OPEN │ │
│ │(test probe)│ │
│ └─────┬──────┘ │
│ │ │
│ Success ────────────┘ Failure ┘
│ (reset)
└──────────────────┘
- CLOSED (normal): Requests pass through normally. Failures are counted. If failures exceed a threshold (e.g., 5 failures in 10 seconds), the circuit opens.
- OPEN (fail fast): All requests are immediately rejected with a fallback (cached data, default value, or error). No calls to the failing service. After a timeout (e.g., 30 seconds), move to half-open.
- HALF-OPEN (test probe): Allow a single test request through. If it succeeds → circuit closes (service recovered). If it fails → circuit opens again.
The Bulkhead Pattern
Named after ship compartments that prevent a hull breach from sinking the whole vessel. In software, you isolate failures by giving each service or operation its own resource pool:
Without Bulkhead:
All services share one thread pool (100 threads)
→ Service B goes slow, consumes 95 threads
→ Only 5 threads left for Service A, C, D
→ Everything slows down
With Bulkhead:
Service A: 30 threads (isolated)
Service B: 30 threads (isolated) → B goes slow, fills its 30
Service C: 20 threads (isolated) → Unaffected
Service D: 20 threads (isolated) → Unaffected
Retry Strategies
| Strategy | How | When |
|---|---|---|
| Immediate retry | Retry instantly | Transient network glitches |
| Fixed delay | Wait N seconds between retries | General purpose |
| Exponential backoff | 1s → 2s → 4s → 8s → ... | Overloaded services (give recovery time) |
| Exponential backoff + jitter | Backoff + random variation | Thundering herd prevention (many clients retrying together) |
[!IMPORTANT] Always use jitter with exponential backoff. Without jitter, if 1,000 clients all fail at the same time, they will all retry at 1s, then 2s, then 4s — creating synchronized spikes. Adding random jitter (e.g., 1.0–1.5s, 2.0–3.0s) spreads retries out and prevents the thundering herd.
Real-World Usage
Netflix Hystrix & resilience4j
Netflix was an early pioneer of the circuit breaker pattern. Their Hystrix library (now in maintenance mode, succeeded by resilience4j) wraps every inter-service call with a circuit breaker. At Netflix''s scale (~250 million subscribers), a single failing microservice can trigger cascading failures across hundreds of services. Circuit breakers isolate failures and return fallback content (e.g., a cached movie recommendation list instead of a personalized one).
Shopify Black Friday/Cyber Monday
Shopify handles $9.3 billion in sales over BFCM weekend. They use bulkhead isolation to ensure that one merchant''s traffic spike doesn''t affect other merchants. Each storefront has resource limits, and circuit breakers protect shared backend services from being overwhelmed.
Implementation: Circuit Breaker in Code
class CircuitBreaker:
def init(self, failure_threshold=5, recovery_timeout=30):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout # seconds
self.state = CLOSED
self.last_failure_time = None
def call(self, func, *args):
if self.state == OPEN:
if now() - self.last_failure_time > self.recovery_timeout:
self.state = HALF_OPEN
else:
return self.fallback() # Fail fast!
try:
result = func(*args)
if self.state == HALF_OPEN:
self.state = CLOSED # Service recovered!
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = now()
if self.failure_count >= self.failure_threshold:
self.state = OPEN
raise
Retry with Exponential Backoff and Jitter
def retry_with_backoff(func, max_retries=3, base_delay=1): for attempt in range(max_retries): try: return func() except TransientError: if attempt == max_retries
1: raise delay = base_delay
(2 ** attempt) # 1s, 2s, 4s jitter = random.uniform(0, delay
0.5) # Prevent thundering herd time.sleep(delay + jitter)
Without jitter: all 1000 clients retry at t=1s, t=2s, t=4s (thundering herd!)
With jitter: clients spread out at t=0.8s, t=1.2s, t=1.4s...
Resilience Patterns Comparison
| Pattern | What It Prevents | When to Use |
|---|---|---|
| Circuit Breaker | Cascading failures from down dependency | Every inter-service call |
| Bulkhead | One slow dependency exhausting all resources | Thread pools, connection pools |
| Retry + Backoff | Transient failures (network blips) | Idempotent operations only |
| Timeout | Indefinite waiting on slow responses | Every external call |
| Fallback | Complete service unavailability | Non-critical features |
| Rate Limiter | Overloading your own service | Public APIs, shared resources |
Common Mistakes
- ❌ Retrying without backoff — hammering a failing service with immediate retries makes the problem worse. Always use exponential backoff with jitter.
- ❌ No fallback in OPEN state — returning a raw 500 error to users when the circuit trips. Always return cached data or a graceful degraded experience.
- ❌ Not logging circuit state changes — circuit breaker state transitions are critical operational signals. Alert on OPEN transitions.
- ❌ Retrying non-idempotent operations — retrying a payment charge can double-charge the customer. Only retry safe-to-repeat operations.
- ❌ Setting recovery timeout too short — if the HALF-OPEN probe comes too quickly, the downstream hasn''t recovered yet. Start with 30+ seconds.
[!TIP] Key Takeaways:
• Circuit breaker: CLOSED → OPEN (fail fast) → HALF-OPEN (test) → CLOSED. Prevents cascading failures.
• Bulkhead: isolate resources per service so one failure cannot drain the whole system.
• Always use exponential backoff with jitter for retries. Never retry without backoff.
• Combine patterns: Circuit Breaker + Retry + Timeout + Bulkhead + Fallback = resilient service.
• Netflix (Hystrix/resilience4j), Shopify, AWS SDK all use these patterns extensively.