28. Observability (Tracing, Logging, SLOs)

[!NOTE] Building a distributed system is hard. Running a distributed system is harder. Observability is the ability to understand what is happening inside your system by looking at its external outputs: logs, metrics, and traces. Without observability, debugging a production issue in a microservices architecture is like finding a needle in a haystack — blindfolded.

The Three Pillars of Observability

Pillar	What It Records	Best For	Tools
Logs	Discrete events with timestamps	Debugging specific errors, audit trails	ELK Stack (Elasticsearch + Logstash + Kibana), Fluentd, Loki
Metrics	Numerical measurements over time	Dashboards, alerting, trend analysis	Prometheus + Grafana, Datadog, CloudWatch
Traces	End-to-end journey of a request across services	Finding bottlenecks in distributed systems	Jaeger, Zipkin, OpenTelemetry, Honeycomb

Distributed Tracing

When a user clicks "Buy Now," the request might flow through 10+ services: API gateway → order service → inventory → payment → notification → shipping. If the total response time is 5 seconds, which service is slow?

Distributed tracing propagates a Trace ID through every service in the request chain. Each service reports its span (start time, end time, metadata):

Trace ID: abc-123

[API Gateway]     |=====|                              (120ms)
  [Order Service]   |=========|                        (200ms)
    [Inventory]       |===|                            (80ms)
    [Payment]             |================|           (450ms) ← BOTTLENECK!
  [Notification]                           |==|        (50ms)

Total: 900ms. Payment service is the bottleneck.

How it works: The first service generates a Trace ID and passes it in HTTP headers (traceparent or X-B3-TraceId). Each downstream service creates a span, links it to the Trace ID, and continues passing it along. All spans are collected by a trace collector (Jaeger, Zipkin) and assembled into a trace view.

SLIs, SLOs, and SLAs

Term	What It Is	Example
SLI (Service Level Indicator)	A measurable metric	p99 latency = 200ms, availability = 99.95%
SLO (Service Level Objective)	A target value for an SLI	"p99 latency must be under 300ms"
SLA (Service Level Agreement)	A contract with penalties	"99.9% uptime or we refund credits"

Error Budgets: If your SLO is 99.9% availability, you have a 0.1% error budget — approximately 43 minutes of downtime per month. Error budgets are spent on deployments, experiments, and incidents. When the budget is exhausted, you freeze feature releases and focus on reliability.

The RED Method

A simple framework for monitoring request-driven services:

Rate: How many requests per second?
Errors: How many of those requests are failing?
Duration: How long do the requests take? (p50, p95, p99)

If you can only have three metrics per service, use RED. It covers the essential health indicators.

Deployment Strategies

Observability is critical for safe deployments. These strategies progressively roll out changes while monitoring for regressions:

Strategy	How	Rollback Speed	Risk
Rolling Update	Replace instances one at a time	Medium	Low — partial rollout
Blue-Green	Run two identical environments; switch traffic from Blue (old) to Green (new) at once	Fast (switch back)	Medium — full traffic on new
Canary	Route 1–5% of traffic to new version; monitor; gradually increase	Fast (route back to old)	Low — only a fraction of users affected
Feature Flags	Deploy the code but toggle features on/off at runtime	Instant (toggle off)	Very low

[!TIP] Canary + observability is the gold standard. Deploy to 1% of traffic, watch your RED metrics and error budget for 15 minutes, then ramp to 10%, 50%, 100%. If errors spike at any stage, roll back instantly. Netflix, Google, and Amazon all use this approach.

Real-World Usage

Google SRE

Google''s Site Reliability Engineering team pioneered many observability concepts. They maintain error budgets for every service. When a team exhausts their error budget, they halt feature development and focus entirely on reliability. This creates a natural balance between velocity and stability.

Jaeger (Uber)

Uber created Jaeger, an open-source distributed tracing system. With 4,000+ microservices, Uber needed tracing to debug latency issues. Jaeger propagates trace context through every service and provides a UI to visualize request flows and identify slow services.

Structured Logging Best Practices

// BAD: Unstructured log
console.log("User 123 placed order 456 for $50.00");

// GOOD: Structured JSON log
logger.info({
  event: "order.placed",
  user_id: "123",
  order_id: "456",
  amount: 50.00,
  currency: "USD",
  trace_id: "abc-def-123",  // Links log to distributed trace
  span_id: "span-789"
});

// Now you can query: "Show all orders > $100 that failed"

Distributed Tracing: Waterfall View

Trace ID: abc-def-123
API Gateway     [||||||||||||||||||||||||||||]  350ms total
  Auth Service    [||||]  15ms
  Order Service         [||||||||||||||||]  200ms  <-- BOTTLENECK
    DB Query                [||||||||||]  150ms    <-- Slow query!
    Cache Write                      [||]  5ms
  Payment Service                           [||||||||]  80ms

With this trace view, you immediately see the 150ms DB query is the bottleneck. Without tracing, you''d only know the request took 350ms total.

Error Budgets: Balancing Speed and Reliability

SLO: 99.9% availability = 43.8 minutes downtime/month
Error Budget = 100%

SLO = 0.1%

Month starts: 43.8 min budget remaining  Week 1: 5 min outage   → 38.8 min remaining  ✅  Week 2: 10 min outage  → 28.8 min remaining  ✅  Week 3: 25 min outage  → 3.8 min remaining   ⚠️ SLOW DOWN  Week 4: Budget exhausted → FREEZE deployments ❌
When budget runs out:  • No new feature deployments  • Focus on reliability improvements only  • Budget resets next month

Common Mistakes

❌ Logging everything — unstructured, verbose logs are worse than no logs. Use structured JSON logging with appropriate levels.
❌ Setting SLOs at 100% — perfection is impossible and destroys velocity. 99.9% is a reasonable target for most services.
❌ Big-bang deployments — deploying to all servers at once. Use canary or blue-green to limit blast radius.
❌ Alerting on metrics without context — CPU at 80% means nothing alone. Alert on SLI violations: "p99 latency exceeded SLO for 10 minutes."
❌ Not correlating logs, metrics, and traces — use trace_id in all three pillars so you can jump from alert → trace → log.

[!TIP] Key Takeaways:
• Three pillars: logs (events), metrics (numbers), traces (request journeys). You need all three.
• Distributed tracing: Trace ID propagated through services. Visualize bottlenecks in a waterfall view.
• SLI → SLO → SLA: measure, set targets, then write contracts. Error budgets balance velocity and reliability.
• RED method: Rate, Errors, Duration — the minimum viable metrics for any service.
• Use structured JSON logging with trace_id for cross-pillar correlation.
• Canary deployments + observability = safe, incremental rollouts.

The Three Pillars of Observability

Distributed Tracing

SLIs, SLOs, and SLAs

The RED Method

Deployment Strategies

Real-World Usage

Google SRE

Jaeger (Uber)

Structured Logging Best Practices

Distributed Tracing: Waterfall View

Error Budgets: Balancing Speed and Reliability

Common Mistakes

Share this article

Test your knowledge

Continue Learning

30. Design a Chat System (WhatsApp)

31. Design YouTube (Video Streaming)

32. Design a Web Crawler