25. Circuit Breakers & Bulkhead Pattern

[!CAUTION] In a microservices architecture, a single slow or failing service can bring down your entire system. If Service A calls Service B and B is slow, A''s threads pool up waiting. Soon A becomes slow too, causing its callers (Service C) to queue up. This cascading failure can propagate through your entire system in seconds. Circuit breakers and bulkheads prevent this.

The Circuit Breaker Pattern

Inspired by electrical circuit breakers, this pattern wraps calls to an external service with a state machine that monitors failures:

         ┌───────────────────────────────────────────────┐
         │                                               │
         ▼                                               │
    ┌─────────┐   Failure threshold   ┌──────────┐      │
    │ CLOSED  │ ────────────────────→ │   OPEN   │      │
    │(normal) │                       │(fail fast)│      │
    └────┬────┘                       └────┬─────┘      │
         │                                  │            │
         │                           After timeout       │
         │                                  │            │
         │                           ┌─────▼──────┐     │
         │                           │ HALF-OPEN  │     │
         │                           │(test probe)│     │
         │                           └─────┬──────┘     │
         │                                  │            │
         │              Success ────────────┘    Failure ┘
         │              (reset)
         └──────────────────┘

CLOSED (normal): Requests pass through normally. Failures are counted. If failures exceed a threshold (e.g., 5 failures in 10 seconds), the circuit opens.
OPEN (fail fast): All requests are immediately rejected with a fallback (cached data, default value, or error). No calls to the failing service. After a timeout (e.g., 30 seconds), move to half-open.
HALF-OPEN (test probe): Allow a single test request through. If it succeeds → circuit closes (service recovered). If it fails → circuit opens again.

The Bulkhead Pattern

Named after ship compartments that prevent a hull breach from sinking the whole vessel. In software, you isolate failures by giving each service or operation its own resource pool:

Without Bulkhead:
  All services share one thread pool (100 threads)
  → Service B goes slow, consumes 95 threads
  → Only 5 threads left for Service A, C, D
  → Everything slows down

With Bulkhead:
  Service A: 30 threads (isolated)
  Service B: 30 threads (isolated) → B goes slow, fills its 30
  Service C: 20 threads (isolated) → Unaffected
  Service D: 20 threads (isolated) → Unaffected

Retry Strategies

Strategy	How	When
Immediate retry	Retry instantly	Transient network glitches
Fixed delay	Wait N seconds between retries	General purpose
Exponential backoff	1s → 2s → 4s → 8s → ...	Overloaded services (give recovery time)
Exponential backoff + jitter	Backoff + random variation	Thundering herd prevention (many clients retrying together)

[!IMPORTANT] Always use jitter with exponential backoff. Without jitter, if 1,000 clients all fail at the same time, they will all retry at 1s, then 2s, then 4s — creating synchronized spikes. Adding random jitter (e.g., 1.0–1.5s, 2.0–3.0s) spreads retries out and prevents the thundering herd.

Real-World Usage

Netflix Hystrix & resilience4j

Netflix was an early pioneer of the circuit breaker pattern. Their Hystrix library (now in maintenance mode, succeeded by resilience4j) wraps every inter-service call with a circuit breaker. At Netflix''s scale (~250 million subscribers), a single failing microservice can trigger cascading failures across hundreds of services. Circuit breakers isolate failures and return fallback content (e.g., a cached movie recommendation list instead of a personalized one).

Shopify Black Friday/Cyber Monday

Shopify handles $9.3 billion in sales over BFCM weekend. They use bulkhead isolation to ensure that one merchant''s traffic spike doesn''t affect other merchants. Each storefront has resource limits, and circuit breakers protect shared backend services from being overwhelmed.

Implementation: Circuit Breaker in Code

class CircuitBreaker:
    def init(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout  # seconds
        self.state = CLOSED
        self.last_failure_time = None

    def call(self, func, *args):
        if self.state == OPEN:
            if now() - self.last_failure_time > self.recovery_timeout:
                self.state = HALF_OPEN
            else:
                return self.fallback()  # Fail fast!

        try:
            result = func(*args)
            if self.state == HALF_OPEN:
                self.state = CLOSED    # Service recovered!
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = now()
            if self.failure_count >= self.failure_threshold:
                self.state = OPEN
            raise

Retry with Exponential Backoff and Jitter

def retry_with_backoff(func, max_retries=3, base_delay=1):    for attempt in range(max_retries):        try:            return func()        except TransientError:            if attempt == max_retries

1:          raise      delay = base_delay


(2 ** attempt)       # 1s, 2s, 4s      jitter = random.uniform(0, delay
0.5)   # Prevent thundering herd      time.sleep(delay + jitter)

Without jitter: all 1000 clients retry at t=1s, t=2s, t=4s (thundering herd!)
With jitter:    clients spread out at t=0.8s, t=1.2s, t=1.4s...

Resilience Patterns Comparison

Pattern	What It Prevents	When to Use
Circuit Breaker	Cascading failures from down dependency	Every inter-service call
Bulkhead	One slow dependency exhausting all resources	Thread pools, connection pools
Retry + Backoff	Transient failures (network blips)	Idempotent operations only
Timeout	Indefinite waiting on slow responses	Every external call
Fallback	Complete service unavailability	Non-critical features
Rate Limiter	Overloading your own service	Public APIs, shared resources

Common Mistakes

❌ Retrying without backoff — hammering a failing service with immediate retries makes the problem worse. Always use exponential backoff with jitter.
❌ No fallback in OPEN state — returning a raw 500 error to users when the circuit trips. Always return cached data or a graceful degraded experience.
❌ Not logging circuit state changes — circuit breaker state transitions are critical operational signals. Alert on OPEN transitions.
❌ Retrying non-idempotent operations — retrying a payment charge can double-charge the customer. Only retry safe-to-repeat operations.
❌ Setting recovery timeout too short — if the HALF-OPEN probe comes too quickly, the downstream hasn''t recovered yet. Start with 30+ seconds.

[!TIP] Key Takeaways:
• Circuit breaker: CLOSED → OPEN (fail fast) → HALF-OPEN (test) → CLOSED. Prevents cascading failures.
• Bulkhead: isolate resources per service so one failure cannot drain the whole system.
• Always use exponential backoff with jitter for retries. Never retry without backoff.
• Combine patterns: Circuit Breaker + Retry + Timeout + Bulkhead + Fallback = resilient service.
• Netflix (Hystrix/resilience4j), Shopify, AWS SDK all use these patterns extensively.

The Circuit Breaker Pattern

The Bulkhead Pattern

Retry Strategies

Real-World Usage

Netflix Hystrix & resilience4j

Shopify Black Friday/Cyber Monday

Implementation: Circuit Breaker in Code

Retry with Exponential Backoff and Jitter

Resilience Patterns Comparison

Common Mistakes

Share this article

Test your knowledge

Continue Learning

26. API Gateway, Proxies & Service Mesh

27. Real-Time Communication

28. Observability (Tracing, Logging, SLOs)