[!IMPORTANT] Downstream latency questions test whether you can stop one slow dependency from causing a full system-wide cascading failure.
🧭 At a Glance
| Technique | Why It Matters |
|---|---|
| Timeouts | Stop threads from waiting forever. |
| Circuit breaker | Stop calling a dependency that is already unhealthy. |
| Fallback | Return partial or degraded response when possible. |
| Bulkhead | Isolate thread pools and connection pools per dependency. |
| Careful retries | Retry only with limits, backoff, and jitter. |
📌 Real Interview Prompt
Question: If one downstream service is experiencing high latency, how would you reduce the impact on your service and the overall system?
✅ Short Answer
I would protect my service with strict timeouts, a circuit breaker, fallback responses, caching, async processing where possible, bulkhead isolation, and limited retries with exponential backoff and jitter. The goal is to fail fast, degrade gracefully, and prevent cascading failure.
🔌 Circuit Breaker States
CLOSED -> calls allowed
OPEN -> calls blocked, fallback returned
HALF_OPEN -> limited trial calls allowed
💬 Expandable Q/A
How does a circuit breaker work?
In CLOSED state, calls go to the downstream service. If failures or timeouts cross a threshold, the breaker moves to OPEN and returns fallback immediately. After a cooldown, it moves to HALF_OPEN and allows a few test calls. If they succeed, it closes; otherwise, it opens again.
Why are timeouts necessary?
Without timeouts, slow downstream calls can consume all request threads and connection pools. Timeouts allow the caller to fail fast and keep capacity for healthy operations.
When should retries be avoided?
Avoid aggressive retries when the downstream is overloaded. Retries can multiply traffic and make the incident worse. Use small retry counts, exponential backoff, jitter, and retry only idempotent operations.
What is bulkhead isolation?
Bulkheads isolate resources per dependency. For example, payment calls and recommendation calls should not share the same exhausted thread pool if recommendation latency spikes.
⚠️ Common Mistakes
- No timeout on downstream calls.
- Retrying too aggressively.
- No fallback for non-critical dependencies.
- One shared thread pool for all dependencies.
- No metrics for timeout rate, circuit state, and dependency latency.
📝 Final Summary
When a downstream service becomes slow, protect your own service first. Use timeouts, circuit breaker, fallback, cache, async queue, bulkhead isolation, and careful retries. The best interview phrase is: fail fast, degrade gracefully, and prevent cascading failure.