[!NOTE] Building a distributed system is hard. Running a distributed system is harder. Observability is the ability to understand what is happening inside your system by looking at its external outputs: logs, metrics, and traces. Without observability, debugging a production issue in a microservices architecture is like finding a needle in a haystack — blindfolded.
The Three Pillars of Observability
| Pillar | What It Records | Best For | Tools |
|---|---|---|---|
| Logs | Discrete events with timestamps | Debugging specific errors, audit trails | ELK Stack (Elasticsearch + Logstash + Kibana), Fluentd, Loki |
| Metrics | Numerical measurements over time | Dashboards, alerting, trend analysis | Prometheus + Grafana, Datadog, CloudWatch |
| Traces | End-to-end journey of a request across services | Finding bottlenecks in distributed systems | Jaeger, Zipkin, OpenTelemetry, Honeycomb |
Distributed Tracing
When a user clicks "Buy Now," the request might flow through 10+ services: API gateway → order service → inventory → payment → notification → shipping. If the total response time is 5 seconds, which service is slow?
Distributed tracing propagates a Trace ID through every service in the request chain. Each service reports its span (start time, end time, metadata):
Trace ID: abc-123
[API Gateway] |=====| (120ms)
[Order Service] |=========| (200ms)
[Inventory] |===| (80ms)
[Payment] |================| (450ms) ← BOTTLENECK!
[Notification] |==| (50ms)
Total: 900ms. Payment service is the bottleneck.
How it works: The first service generates a Trace ID and passes it in HTTP headers (traceparent or X-B3-TraceId). Each downstream service creates a span, links it to the Trace ID, and continues passing it along. All spans are collected by a trace collector (Jaeger, Zipkin) and assembled into a trace view.
SLIs, SLOs, and SLAs
| Term | What It Is | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measurable metric | p99 latency = 200ms, availability = 99.95% |
| SLO (Service Level Objective) | A target value for an SLI | "p99 latency must be under 300ms" |
| SLA (Service Level Agreement) | A contract with penalties | "99.9% uptime or we refund credits" |
Error Budgets: If your SLO is 99.9% availability, you have a 0.1% error budget — approximately 43 minutes of downtime per month. Error budgets are spent on deployments, experiments, and incidents. When the budget is exhausted, you freeze feature releases and focus on reliability.
The RED Method
A simple framework for monitoring request-driven services:
- Rate: How many requests per second?
- Errors: How many of those requests are failing?
- Duration: How long do the requests take? (p50, p95, p99)
If you can only have three metrics per service, use RED. It covers the essential health indicators.
Deployment Strategies
Observability is critical for safe deployments. These strategies progressively roll out changes while monitoring for regressions:
| Strategy | How | Rollback Speed | Risk |
|---|---|---|---|
| Rolling Update | Replace instances one at a time | Medium | Low — partial rollout |
| Blue-Green | Run two identical environments; switch traffic from Blue (old) to Green (new) at once | Fast (switch back) | Medium — full traffic on new |
| Canary | Route 1–5% of traffic to new version; monitor; gradually increase | Fast (route back to old) | Low — only a fraction of users affected |
| Feature Flags | Deploy the code but toggle features on/off at runtime | Instant (toggle off) | Very low |
[!TIP] Canary + observability is the gold standard. Deploy to 1% of traffic, watch your RED metrics and error budget for 15 minutes, then ramp to 10%, 50%, 100%. If errors spike at any stage, roll back instantly. Netflix, Google, and Amazon all use this approach.
Real-World Usage
Google SRE
Google''s Site Reliability Engineering team pioneered many observability concepts. They maintain error budgets for every service. When a team exhausts their error budget, they halt feature development and focus entirely on reliability. This creates a natural balance between velocity and stability.
Jaeger (Uber)
Uber created Jaeger, an open-source distributed tracing system. With 4,000+ microservices, Uber needed tracing to debug latency issues. Jaeger propagates trace context through every service and provides a UI to visualize request flows and identify slow services.
Structured Logging Best Practices
// BAD: Unstructured log
console.log("User 123 placed order 456 for $50.00");
// GOOD: Structured JSON log
logger.info({
event: "order.placed",
user_id: "123",
order_id: "456",
amount: 50.00,
currency: "USD",
trace_id: "abc-def-123", // Links log to distributed trace
span_id: "span-789"
});
// Now you can query: "Show all orders > $100 that failed"
Distributed Tracing: Waterfall View
Trace ID: abc-def-123
API Gateway [||||||||||||||||||||||||||||] 350ms total
Auth Service [||||] 15ms
Order Service [||||||||||||||||] 200ms <-- BOTTLENECK
DB Query [||||||||||] 150ms <-- Slow query!
Cache Write [||] 5ms
Payment Service [||||||||] 80ms
With this trace view, you immediately see the 150ms DB query is the bottleneck. Without tracing, you''d only know the request took 350ms total.
Error Budgets: Balancing Speed and Reliability
SLO: 99.9% availability = 43.8 minutes downtime/month
Error Budget = 100%
SLO = 0.1%
Month starts: 43.8 min budget remaining Week 1: 5 min outage → 38.8 min remaining ✅ Week 2: 10 min outage → 28.8 min remaining ✅ Week 3: 25 min outage → 3.8 min remaining ⚠️ SLOW DOWN Week 4: Budget exhausted → FREEZE deployments ❌
When budget runs out: • No new feature deployments • Focus on reliability improvements only • Budget resets next month
Common Mistakes
- ❌ Logging everything — unstructured, verbose logs are worse than no logs. Use structured JSON logging with appropriate levels.
- ❌ Setting SLOs at 100% — perfection is impossible and destroys velocity. 99.9% is a reasonable target for most services.
- ❌ Big-bang deployments — deploying to all servers at once. Use canary or blue-green to limit blast radius.
- ❌ Alerting on metrics without context — CPU at 80% means nothing alone. Alert on SLI violations: "p99 latency exceeded SLO for 10 minutes."
- ❌ Not correlating logs, metrics, and traces — use trace_id in all three pillars so you can jump from alert → trace → log.
[!TIP] Key Takeaways:
• Three pillars: logs (events), metrics (numbers), traces (request journeys). You need all three.
• Distributed tracing: Trace ID propagated through services. Visualize bottlenecks in a waterfall view.
• SLI → SLO → SLA: measure, set targets, then write contracts. Error budgets balance velocity and reliability.
• RED method: Rate, Errors, Duration — the minimum viable metrics for any service.
• Use structured JSON logging with trace_id for cross-pillar correlation.
• Canary deployments + observability = safe, incremental rollouts.