Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
5. Advanced Concepts & Patterns
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

28. Observability (Tracing, Logging, SLOs)

The three pillars of observability, deployment strategies, and how to know when your system is healthy.

Mar 5, 202617 views0 likes0 fires
18px

[!NOTE] Building a distributed system is hard. Running a distributed system is harder. Observability is the ability to understand what is happening inside your system by looking at its external outputs: logs, metrics, and traces. Without observability, debugging a production issue in a microservices architecture is like finding a needle in a haystack — blindfolded.

The Three Pillars of Observability

PillarWhat It RecordsBest ForTools
LogsDiscrete events with timestampsDebugging specific errors, audit trailsELK Stack (Elasticsearch + Logstash + Kibana), Fluentd, Loki
MetricsNumerical measurements over timeDashboards, alerting, trend analysisPrometheus + Grafana, Datadog, CloudWatch
TracesEnd-to-end journey of a request across servicesFinding bottlenecks in distributed systemsJaeger, Zipkin, OpenTelemetry, Honeycomb

Distributed Tracing

When a user clicks "Buy Now," the request might flow through 10+ services: API gateway → order service → inventory → payment → notification → shipping. If the total response time is 5 seconds, which service is slow?

Distributed tracing propagates a Trace ID through every service in the request chain. Each service reports its span (start time, end time, metadata):

Trace ID: abc-123

[API Gateway]     |=====|                              (120ms)
  [Order Service]   |=========|                        (200ms)
    [Inventory]       |===|                            (80ms)
    [Payment]             |================|           (450ms) ← BOTTLENECK!
  [Notification]                           |==|        (50ms)

Total: 900ms. Payment service is the bottleneck.

How it works: The first service generates a Trace ID and passes it in HTTP headers (traceparent or X-B3-TraceId). Each downstream service creates a span, links it to the Trace ID, and continues passing it along. All spans are collected by a trace collector (Jaeger, Zipkin) and assembled into a trace view.

SLIs, SLOs, and SLAs

TermWhat It IsExample
SLI (Service Level Indicator)A measurable metricp99 latency = 200ms, availability = 99.95%
SLO (Service Level Objective)A target value for an SLI"p99 latency must be under 300ms"
SLA (Service Level Agreement)A contract with penalties"99.9% uptime or we refund credits"

Error Budgets: If your SLO is 99.9% availability, you have a 0.1% error budget — approximately 43 minutes of downtime per month. Error budgets are spent on deployments, experiments, and incidents. When the budget is exhausted, you freeze feature releases and focus on reliability.

The RED Method

A simple framework for monitoring request-driven services:

  • Rate: How many requests per second?
  • Errors: How many of those requests are failing?
  • Duration: How long do the requests take? (p50, p95, p99)

If you can only have three metrics per service, use RED. It covers the essential health indicators.

Deployment Strategies

Observability is critical for safe deployments. These strategies progressively roll out changes while monitoring for regressions:

StrategyHowRollback SpeedRisk
Rolling UpdateReplace instances one at a timeMediumLow — partial rollout
Blue-GreenRun two identical environments; switch traffic from Blue (old) to Green (new) at onceFast (switch back)Medium — full traffic on new
CanaryRoute 1–5% of traffic to new version; monitor; gradually increaseFast (route back to old)Low — only a fraction of users affected
Feature FlagsDeploy the code but toggle features on/off at runtimeInstant (toggle off)Very low

[!TIP] Canary + observability is the gold standard. Deploy to 1% of traffic, watch your RED metrics and error budget for 15 minutes, then ramp to 10%, 50%, 100%. If errors spike at any stage, roll back instantly. Netflix, Google, and Amazon all use this approach.

Real-World Usage

Google SRE

Google''s Site Reliability Engineering team pioneered many observability concepts. They maintain error budgets for every service. When a team exhausts their error budget, they halt feature development and focus entirely on reliability. This creates a natural balance between velocity and stability.

Jaeger (Uber)

Uber created Jaeger, an open-source distributed tracing system. With 4,000+ microservices, Uber needed tracing to debug latency issues. Jaeger propagates trace context through every service and provides a UI to visualize request flows and identify slow services.

Structured Logging Best Practices

// BAD: Unstructured log
console.log("User 123 placed order 456 for $50.00");

// GOOD: Structured JSON log
logger.info({
  event: "order.placed",
  user_id: "123",
  order_id: "456",
  amount: 50.00,
  currency: "USD",
  trace_id: "abc-def-123",  // Links log to distributed trace
  span_id: "span-789"
});

// Now you can query: "Show all orders > $100 that failed"

Distributed Tracing: Waterfall View

Trace ID: abc-def-123
API Gateway     [||||||||||||||||||||||||||||]  350ms total
  Auth Service    [||||]  15ms
  Order Service         [||||||||||||||||]  200ms  <-- BOTTLENECK
    DB Query                [||||||||||]  150ms    <-- Slow query!
    Cache Write                      [||]  5ms
  Payment Service                           [||||||||]  80ms

With this trace view, you immediately see the 150ms DB query is the bottleneck. Without tracing, you''d only know the request took 350ms total.

Error Budgets: Balancing Speed and Reliability

SLO: 99.9% availability = 43.8 minutes downtime/month
Error Budget = 100%

SLO = 0.1%

Month starts: 43.8 min budget remaining  Week 1: 5 min outage   → 38.8 min remaining  ✅  Week 2: 10 min outage  → 28.8 min remaining  ✅  Week 3: 25 min outage  → 3.8 min remaining   ⚠️ SLOW DOWN  Week 4: Budget exhausted → FREEZE deployments ❌
When budget runs out:  • No new feature deployments  • Focus on reliability improvements only  • Budget resets next month

Common Mistakes

  • ❌ Logging everything — unstructured, verbose logs are worse than no logs. Use structured JSON logging with appropriate levels.
  • ❌ Setting SLOs at 100% — perfection is impossible and destroys velocity. 99.9% is a reasonable target for most services.
  • ❌ Big-bang deployments — deploying to all servers at once. Use canary or blue-green to limit blast radius.
  • ❌ Alerting on metrics without context — CPU at 80% means nothing alone. Alert on SLI violations: "p99 latency exceeded SLO for 10 minutes."
  • ❌ Not correlating logs, metrics, and traces — use trace_id in all three pillars so you can jump from alert → trace → log.

[!TIP] Key Takeaways:
• Three pillars: logs (events), metrics (numbers), traces (request journeys). You need all three.
• Distributed tracing: Trace ID propagated through services. Visualize bottlenecks in a waterfall view.
• SLI → SLO → SLA: measure, set targets, then write contracts. Error budgets balance velocity and reliability.
• RED method: Rate, Errors, Duration — the minimum viable metrics for any service.
• Use structured JSON logging with trace_id for cross-pillar correlation.
• Canary deployments + observability = safe, incremental rollouts.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

mediumSystem Design
Quiz: Observability
5 questions5 min

Continue Learning

30. Design a Chat System (WhatsApp)

Advanced
18 min

31. Design YouTube (Video Streaming)

Advanced
18 min

32. Design a Web Crawler

Advanced
16 min
Lesson 12 of 12 in 5. Advanced Concepts & Patterns
Previous in 5. Advanced Concepts & Patterns
27. Real-Time Communication
Completed
You finished this lesson → take the quiz
5 questions • 5 min
Next section: 6. More Case Studies
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories