Observability (Metrics, Logs, and Traces)
A simple guide for software engineers
Observability is the ability to understand what’s happening inside your system by looking at its outputs. It rests on three pillars: metrics, logs, and traces. Each serves a distinct purpose, and understanding when to use each one will make you significantly more effective at debugging and monitoring production systems.
Metrics
Metrics are numerical measurements collected at regular intervals. CPU usage, memory consumption, request count, error rate, response time.
They answer one question: “How is the system doing right now?”
Metrics are compact. A single data point is just a timestamp and a number, which means you can collect and store millions of them efficiently. This makes metrics ideal for dashboards, alerting, and trend analysis.
The “Four Golden Signals” are all metrics:
Latency - how long requests take
Traffic - request volume
Errors - failure rate
Saturation - resource utilization
When an alert fires, it’s almost always because a metric crossed a threshold.
Limitation: Metrics tell you that something is wrong, not what is wrong. A spike in error rate doesn’t tell you which errors, which users, or which code path.
Logs
Logs are records of discrete events. A user logged in. A request was received. An error was thrown. A configuration changed.
They answer a different question: “What happened?”
Logs contain rich detail: timestamps, user IDs, request parameters, stack traces, error messages. This detail is essential for debugging. When you need to know exactly what went wrong for a specific request, you look at logs.
Limitation: Logs are verbose. A busy system generates enormous volumes of log data. Storing, indexing, and searching logs at scale is expensive. Unstructured logs make analysis harder. High log generation rates can even impact application performance.
The typical workflow: use metrics to detect a problem, then query logs to understand what happened.
Traces
Traces track a single request as it flows through a distributed system. When one user action triggers calls across multiple services, a trace shows the complete path.
They answer: “How did this request move through the system?”
A trace is a collection of spans. Each span represents a unit of work: an HTTP request, a database query, a message processed from a queue. Spans include timing information, so you can see exactly where time was spent.
Traces are essential for debugging latency in microservices. When a request is slow, a trace shows you which service or operation is the bottleneck.
Limitation: Traces require instrumentation. You need to propagate trace IDs through your code and across service boundaries. Tracing also generates significant data volume, particularly in high-throughput systems. Sampling is often necessary, which means you won’t capture every request.
How They Work Together
Each pillar has a role:
Metrics detect problems. Your error rate spiked. Response times are climbing. Something is wrong.
Logs explain problems. The database connection failed. A null pointer exception occurred. The config was invalid.
Traces locate problems in distributed systems. The request was fine until it reached the payment service, which called inventory, which timed out.
In practice, you move between all three. Metrics for detection and alerting. Logs for detailed investigation. Traces for understanding request flow across services.
In short
Metrics are efficient and great for alerting but lack detail. Logs are detailed but expensive at scale. Traces show request flow but require instrumentation.
Use metrics to know something is wrong. Use logs to understand what went wrong. Use traces to see where it went wrong.
The three pillars complement each other. Relying on just one leaves gaps in your ability to understand and debug your systems.
Like posts like this?
You may also like these:
By subscribing, you get a breakdown like this every week.
Free subscribers also get a little bonus:
🎁 The System Design Interview Preparation Cheat Sheet
If you’re into visuals, paid subscribers unlock:
→ My Excalidraw system design template – so you have somewhere to start
→ My Excalidraw component library – used in the diagram of this issue
No pressure though. Your support helps me keep writing, and I appreciate it more than you know ❤️







