Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
5. Advanced Concepts & Patterns
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

25. Circuit Breakers & Bulkhead Pattern

How to prevent cascading failures in distributed systems using circuit breakers, bulkheads, and retry strategies.

Mar 5, 202634 views0 likes0 fires
18px

[!CAUTION] In a microservices architecture, a single slow or failing service can bring down your entire system. If Service A calls Service B and B is slow, A''s threads pool up waiting. Soon A becomes slow too, causing its callers (Service C) to queue up. This cascading failure can propagate through your entire system in seconds. Circuit breakers and bulkheads prevent this.

The Circuit Breaker Pattern

Inspired by electrical circuit breakers, this pattern wraps calls to an external service with a state machine that monitors failures:

         ┌───────────────────────────────────────────────┐
         │                                               │
         ▼                                               │
    ┌─────────┐   Failure threshold   ┌──────────┐      │
    │ CLOSED  │ ────────────────────→ │   OPEN   │      │
    │(normal) │                       │(fail fast)│      │
    └────┬────┘                       └────┬─────┘      │
         │                                  │            │
         │                           After timeout       │
         │                                  │            │
         │                           ┌─────▼──────┐     │
         │                           │ HALF-OPEN  │     │
         │                           │(test probe)│     │
         │                           └─────┬──────┘     │
         │                                  │            │
         │              Success ────────────┘    Failure ┘
         │              (reset)
         └──────────────────┘
  • CLOSED (normal): Requests pass through normally. Failures are counted. If failures exceed a threshold (e.g., 5 failures in 10 seconds), the circuit opens.
  • OPEN (fail fast): All requests are immediately rejected with a fallback (cached data, default value, or error). No calls to the failing service. After a timeout (e.g., 30 seconds), move to half-open.
  • HALF-OPEN (test probe): Allow a single test request through. If it succeeds → circuit closes (service recovered). If it fails → circuit opens again.

The Bulkhead Pattern

Named after ship compartments that prevent a hull breach from sinking the whole vessel. In software, you isolate failures by giving each service or operation its own resource pool:

Without Bulkhead:
  All services share one thread pool (100 threads)
  → Service B goes slow, consumes 95 threads
  → Only 5 threads left for Service A, C, D
  → Everything slows down

With Bulkhead:
  Service A: 30 threads (isolated)
  Service B: 30 threads (isolated) → B goes slow, fills its 30
  Service C: 20 threads (isolated) → Unaffected
  Service D: 20 threads (isolated) → Unaffected

Retry Strategies

StrategyHowWhen
Immediate retryRetry instantlyTransient network glitches
Fixed delayWait N seconds between retriesGeneral purpose
Exponential backoff1s → 2s → 4s → 8s → ...Overloaded services (give recovery time)
Exponential backoff + jitterBackoff + random variationThundering herd prevention (many clients retrying together)

[!IMPORTANT] Always use jitter with exponential backoff. Without jitter, if 1,000 clients all fail at the same time, they will all retry at 1s, then 2s, then 4s — creating synchronized spikes. Adding random jitter (e.g., 1.0–1.5s, 2.0–3.0s) spreads retries out and prevents the thundering herd.

Real-World Usage

Netflix Hystrix & resilience4j

Netflix was an early pioneer of the circuit breaker pattern. Their Hystrix library (now in maintenance mode, succeeded by resilience4j) wraps every inter-service call with a circuit breaker. At Netflix''s scale (~250 million subscribers), a single failing microservice can trigger cascading failures across hundreds of services. Circuit breakers isolate failures and return fallback content (e.g., a cached movie recommendation list instead of a personalized one).

Shopify Black Friday/Cyber Monday

Shopify handles $9.3 billion in sales over BFCM weekend. They use bulkhead isolation to ensure that one merchant''s traffic spike doesn''t affect other merchants. Each storefront has resource limits, and circuit breakers protect shared backend services from being overwhelmed.

Implementation: Circuit Breaker in Code

class CircuitBreaker:
    def init(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout  # seconds
        self.state = CLOSED
        self.last_failure_time = None

    def call(self, func, *args):
        if self.state == OPEN:
            if now() - self.last_failure_time > self.recovery_timeout:
                self.state = HALF_OPEN
            else:
                return self.fallback()  # Fail fast!

        try:
            result = func(*args)
            if self.state == HALF_OPEN:
                self.state = CLOSED    # Service recovered!
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = now()
            if self.failure_count >= self.failure_threshold:
                self.state = OPEN
            raise

Retry with Exponential Backoff and Jitter

def retry_with_backoff(func, max_retries=3, base_delay=1):    for attempt in range(max_retries):        try:            return func()        except TransientError:            if attempt == max_retries

1:          raise      delay = base_delay


(2 ** attempt)       # 1s, 2s, 4s      jitter = random.uniform(0, delay
0.5)   # Prevent thundering herd      time.sleep(delay + jitter)

Without jitter: all 1000 clients retry at t=1s, t=2s, t=4s (thundering herd!)
With jitter:    clients spread out at t=0.8s, t=1.2s, t=1.4s...

Resilience Patterns Comparison

PatternWhat It PreventsWhen to Use
Circuit BreakerCascading failures from down dependencyEvery inter-service call
BulkheadOne slow dependency exhausting all resourcesThread pools, connection pools
Retry + BackoffTransient failures (network blips)Idempotent operations only
TimeoutIndefinite waiting on slow responsesEvery external call
FallbackComplete service unavailabilityNon-critical features
Rate LimiterOverloading your own servicePublic APIs, shared resources

Common Mistakes

  • ❌ Retrying without backoff — hammering a failing service with immediate retries makes the problem worse. Always use exponential backoff with jitter.
  • ❌ No fallback in OPEN state — returning a raw 500 error to users when the circuit trips. Always return cached data or a graceful degraded experience.
  • ❌ Not logging circuit state changes — circuit breaker state transitions are critical operational signals. Alert on OPEN transitions.
  • ❌ Retrying non-idempotent operations — retrying a payment charge can double-charge the customer. Only retry safe-to-repeat operations.
  • ❌ Setting recovery timeout too short — if the HALF-OPEN probe comes too quickly, the downstream hasn''t recovered yet. Start with 30+ seconds.

[!TIP] Key Takeaways:
• Circuit breaker: CLOSED → OPEN (fail fast) → HALF-OPEN (test) → CLOSED. Prevents cascading failures.
• Bulkhead: isolate resources per service so one failure cannot drain the whole system.
• Always use exponential backoff with jitter for retries. Never retry without backoff.
• Combine patterns: Circuit Breaker + Retry + Timeout + Bulkhead + Fallback = resilient service.
• Netflix (Hystrix/resilience4j), Shopify, AWS SDK all use these patterns extensively.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

mediumSystem Design
Quiz: Circuit Breakers
5 questions5 min

Continue Learning

26. API Gateway, Proxies & Service Mesh

Intermediate
14 min

27. Real-Time Communication

Intermediate
14 min

28. Observability (Tracing, Logging, SLOs)

Intermediate
14 min
Lesson 9 of 12 in 5. Advanced Concepts & Patterns
Previous in 5. Advanced Concepts & Patterns
24. Rate Limiting Algorithms
Next in 5. Advanced Concepts & Patterns
26. API Gateway, Proxies & Service Mesh
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories