I remember the day our production system went down at 2 AM. Our CEO asked, “What happened?” and I had three options: look at dashboards that showed nothing useful, dig through terabytes of logs with grep, or pray. Spoiler alert: I prayed. And that’s when I realized we’d been doing observability all wrong. Fast forward to today, and observability has become the holy grail of modern engineering. But here’s the dirty secret vendors won’t tell you: you don’t need a six-figure annual contract with a SaaS platform to have decent observability. You need understanding, strategy, and a little elbow grease. This article is my personal journey through building pragmatic observability on a shoestring budget. Whether you’re a startup burning through venture capital or an established company tired of SaaS subscription fatigue, this guide will show you how to implement the three pillars of observability—metrics, logs, and traces—without selling your firstborn.

Why Observability Matters (And Why Traditional Monitoring Fails)

Before we get into the nitty-gritty, let’s establish why observability is different from traditional monitoring. Traditional monitoring is like having a thermostat: it tells you the current temperature, nothing more. Observability is like having a doctor who can ask you questions and diagnose the problem. Metrics are the vital signs of your application. CPU usage, request rates, error counts—these are numerical measurements aggregated over time. They’re efficient and perfect for dashboards and alerting. Logs are timestamped, immutable records of what actually happened. They provide granular context about a specific error or operation, telling you the specifics of what went wrong at a particular moment. Traces represent the complete, end-to-end journey of a single request as it propagates through your microservices. They answer the crucial question of where a problem occurred in your complex workflow. The magic happens when you unite these three signals. An SRE can start with a high-level metric alert, drill down into the specific error message in a log, and see exactly which downstream service call failed. That’s observability.

The Budget Reality Check

Let’s be honest: enterprise observability platforms are expensive. We’re talking $10,000-$100,000+ annually depending on data volume. For many organizations, especially startups, this is a luxury they can’t afford. The good news? The open-source ecosystem has matured dramatically. You can build a production-grade observability stack using:

  • Prometheus for metrics (free, battle-tested)
  • Loki or ELK Stack for logs (free/cheap)
  • Jaeger or Zipkin for distributed tracing (free)
  • OpenTelemetry as your instrumentation standard (free) Your only real cost is infrastructure (which you’d have anyway) and operational effort (which teaches you more than outsourcing ever would).

Understanding Metrics: The Foundation

Metrics are your high-level overview. They’re aggregated numbers with tags attached—think “HTTP requests per second” or “database connection pool usage.”

Why Metrics Matter

Metrics excel at:

  • Detecting baseline anomalies (“our error rate just jumped 10%”)
  • Building dashboards for executives (“here’s our uptime this quarter”)
  • Setting up alerts (“page me when this threshold is crossed”)
  • Storing efficiently (a metric takes kilobytes, not gigabytes)

Practical Implementation with Prometheus

Prometheus is the de facto standard for metrics collection. Here’s a minimal setup:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

On your application side, expose metrics using a library compatible with your language. Here’s a Go example using Prometheus client:

package main
import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
    requestCounter = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)
func init() {
    prometheus.MustRegister(requestCounter)
    prometheus.MustRegister(requestDuration)
}
func main() {
    http.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        // Your business logic here
        w.WriteHeader(http.StatusOK)
        w.Write([]byte(`{"status": "ok"}`))
        // Record metrics
        requestCounter.WithLabelValues(
            r.Method,
            "/api/users",
            "200",
        ).Inc()
        requestDuration.WithLabelValues(
            r.Method,
            "/api/users",
        ).Observe(time.Since(start).Seconds())
    })
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Key Metrics to Track

Every application should expose these baseline metrics:

  • Request rate: requests per second (broken by endpoint and method)
  • Error rate: failed requests as a percentage
  • Latency: p50, p95, p99 response times
  • Resource usage: CPU, memory, disk I/O
  • Business metrics: user signups, transactions completed, revenue

Querying Metrics with PromQL

Prometheus uses PromQL for querying. Some essential queries:

# Request rate over the last 5 minutes
rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, http_request_duration_seconds)
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Memory usage in MB
node_memory_MemAvailable_bytes / 1024 / 1024

Understanding Logs: The Detailed Story

While metrics give you the headline, logs give you the full investigative report. They’re textual records that capture what happened, why it happened, and the complete context.

The Log Hierarchy

Log levels exist for a reason:

  • DEBUG: Development information, verbose output
  • INFO: Normal operations, key events
  • WARNING: Something unexpected happened, but we recovered
  • ERROR: Something failed, manual intervention may be needed
  • CRITICAL: System is in a bad state, action required immediately

Structured Logging: The Game Changer

The difference between good and bad logging is structure. Compare these:

Bad: "Error processing request"
Good: {"timestamp": "2026-02-14T14:30:45Z", "level": "error", "service": "payment-service", "user_id": "usr_123", "error": "insufficient_funds", "amount": 99.99, "request_id": "req_456"}

Structured logging (JSON format) lets you search, filter, and correlate logs effortlessly. Here’s a Python example:

import logging
import json
from pythonjsonlogger import jsonlogger
# Setup JSON logging
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger = logging.getLogger()
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)
# Usage
logger.info("payment_processed", extra={
    "user_id": "usr_123",
    "amount": 99.99,
    "currency": "USD",
    "transaction_id": "txn_789",
    "processing_time_ms": 245
})
logger.error("payment_failed", extra={
    "user_id": "usr_456",
    "amount": 49.99,
    "error_code": "INSUFFICIENT_FUNDS",
    "retry_count": 3,
    "request_id": "req_999"
})

Budget-Friendly Log Aggregation with Loki

Grafana Loki is specifically designed for the budget-conscious. Unlike ELK (which indexes everything), Loki only indexes labels, keeping storage costs low.

# loki-config.yml
auth_enabled: false
ingester:
  chunk_idle_period: 3m
  max_chunk_age: 1h
  max_streams_limit: 34722
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
server:
  http_listen_port: 3100
storage_config:
  filesystem:
    directory: /loki/chunks
chunk_store_config:
  max_look_back_period: 0s

And a minimal Promtail config to ship logs to Loki:

# promtail-config.yml
clients:
  - url: http://loki:3100/loki/api/v1/push
scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*log
  - job_name: docker
    docker: {}
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        target_label: 'container'

Real-World Log Analysis

Here’s how you’d typically search logs in a budget stack:

# Using LogCLI (Loki's command-line interface)
# Find all errors in the payment service
logcli query '{job="payment-service", level="error"}'
# Find logs from a specific user within a time range
logcli query '{user_id="usr_123"}' --from="2026-02-14T10:00:00Z" --to="2026-02-14T14:00:00Z"
# Count errors by error_code
logcli query 'level="error"' | jq -r '.error_code' | sort | uniq -c

Understanding Traces: Following the Breadcrumbs

Here’s where it gets interesting. In a microservices world, a single user request might touch 10-15 different services. Without traces, finding which one failed is like debugging with your eyes closed. A trace represents the complete journey of a single request through your system. It consists of spans, each representing a specific operation or component.

The Anatomy of a Trace

Let’s use a real example: a user completes a purchase on an e-commerce platform.

Trace ID: trace_2026021414300001
├─ Span: API Gateway (0ms → 450ms)
│  └─ Tags: user_id=usr_123, endpoint=/checkout
├─ Span: Auth Service (20ms → 80ms)
│  └─ Tags: user_id=usr_123, status=authorized
├─ Span: Inventory Service (90ms → 200ms)
│  └─ Tags: product_id=prod_456, reserved=true
├─ Span: Payment Service (210ms → 380ms)
│  └─ Tags: amount=99.99, processor=stripe
│  └─ Log: "Charge successful" (350ms)
└─ Span: Order Service (390ms → 450ms)
   └─ Tags: order_id=ord_789, status=created

Each span can have metadata attached—tags like user_id, status_code, or even the exact database query.

Implementing Distributed Tracing with Jaeger

Jaeger is the open-source distributed tracing platform born at Uber. Here’s a minimal Node.js example:

const jaeger = require('jaeger-client');
const opentracing = require('opentracing');
const initTracer = (serviceName) => {
    const sampler = {
        type: 'const',
        param: 1,
    };
    const options = {
        serviceName,
        sampler,
        reporter: {
            logSpans: true,
            agentHost: 'localhost',
            agentPort: 6831,
        },
    };
    return jaeger.initTracer(options);
};
const tracer = initTracer('payment-service');
// Middleware to extract trace context from requests
app.use((req, res, next) => {
    const wireCtx = tracer.extract(opentracing.FORMAT_HTTP_HEADERS, req.headers);
    const span = tracer.startSpan(req.path, { childOf: wireCtx });
    span.setTag(opentracing.Tags.SPAN_KIND, opentracing.Tags.SPAN_KIND_RPC_SERVER);
    span.setTag(opentracing.Tags.HTTP_METHOD, req.method);
    span.setTag(opentracing.Tags.HTTP_URL, req.url);
    span.setTag('user.id', req.user?.id);
    res.on('finish', () => {
        span.setTag(opentracing.Tags.HTTP_STATUS_CODE, res.statusCode);
        span.finish();
    });
    opentracing.setGlobalTracer(tracer);
    next();
});
// Your endpoint
app.post('/api/payment', async (req, res) => {
    const parentSpan = opentracing.getGlobalTracer().startActive('payment_processing').span;
    try {
        // Span 1: Validate payment
        const validationSpan = tracer.startSpan('validate_payment', { childOf: parentSpan });
        const isValid = await validatePayment(req.body);
        validationSpan.log({ event: 'validation_complete', result: isValid });
        validationSpan.finish();
        // Span 2: Process with payment gateway
        const gatewaySpan = tracer.startSpan('gateway_request', { childOf: parentSpan });
        gatewaySpan.setTag('gateway', 'stripe');
        const result = await stripe.charges.create({
            amount: req.body.amount,
            currency: 'usd',
            source: req.body.token,
        });
        gatewaySpan.log({ event: 'charge_created', charge_id: result.id });
        gatewaySpan.finish();
        // Span 3: Create order
        const orderSpan = tracer.startSpan('create_order', { childOf: parentSpan });
        const order = await db.orders.create({
            user_id: req.user.id,
            charge_id: result.id,
            amount: req.body.amount,
        });
        orderSpan.log({ event: 'order_created', order_id: order.id });
        orderSpan.finish();
        parentSpan.finish();
        res.json({ success: true, order_id: order.id });
    } catch (error) {
        parentSpan.setTag(opentracing.Tags.ERROR, true);
        parentSpan.log({ event: 'error', message: error.message, stack: error.stack });
        parentSpan.finish();
        res.status(500).json({ error: error.message });
    }
});

The Trace-Log Connection

Here’s the beautiful part: you can correlate logs and traces using a shared identifier. Every log should include the trace_id:

// Enhanced logging with trace correlation
const span = opentracing.getGlobalTracer().activeSpan;
const traceId = span?.context().traceID;
logger.info("payment_initiated", extra={
    "trace_id": traceId,
    "user_id": req.user.id,
    "amount": req.body.amount,
    "timestamp": new Date().toISOString()
});

The Budget Observability Stack: Putting It Together

Let me show you how these three pillars work together in practice. Here’s a visual representation:

graph TB A[User Request] -->|triggers| B[API Gateway] B -->|metric: request_count| M[(Prometheus)] B -->|creates span| J[(Jaeger)] B -->|logs event| L[(Loki)] B -->|calls| C[Payment Service] C -->|metric: latency| M C -->|span: payment_processing| J C -->|logs with trace_id| L C -->|calls| D[Database] D -->|metric: query_time| M D -->|span: db_query| J D -->|logs slow query| L C -->|calls| E[Payment Gateway] E -->|metric: gateway_latency| M E -->|span: external_api| J E -->|logs response| L M -->|queries| G[Grafana] J -->|queries| G L -->|queries| G G -->|displays| H[Unified Dashboard] style M fill:#ff9900 style J fill:#1f77b4 style L fill:#2ca02c style H fill:#d62728

Architecture Overview

Here’s how I’d structure a budget observability platform: Infrastructure:

  • Docker containers for Prometheus, Loki, and Jaeger
  • A central Grafana instance for unified dashboards
  • Minimal resource footprint (you can run all of this on a single 4GB machine initially) Data Flow:
  1. Applications emit metrics to Prometheus
  2. Applications send structured logs to Loki via Promtail
  3. Applications emit traces to Jaeger
  4. Grafana queries all three data sources and presents unified insights

Sample Docker Compose Setup

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    ports:
      - "9090:9090"
    networks:
      - observability
  loki:
    image: grafana/loki:latest
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    ports:
      - "3100:3100"
    networks:
      - observability
  jaeger:
    image: jaegertracing/all-in-one:latest
    environment:
      COLLECTOR_ZIPKIN_HOST_PORT: ":9411"
    ports:
      - "6831:6831/udp"
      - "16686:16686"
    networks:
      - observability
  grafana:
    image: grafana/grafana:latest
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
      - loki
      - jaeger
    networks:
      - observability
volumes:
  prometheus_data:
  loki_data:
  grafana_data:
networks:
  observability:
    driver: bridge

The Practical Workflow: From Alert to Root Cause

Let me walk you through a real scenario. Your error rate just spiked from 0.5% to 8% at 2:47 PM. Step 1: Alert Detection (Metrics) Your Prometheus alert fires:

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 2m
  annotations:
    summary: "High error rate detected: {{ $value | humanizePercentage }}"

You get paged. Your adrenaline spikes. Step 2: Dashboard Inspection You jump into Grafana and see:

  • HTTP request latency is normal (p95 = 250ms)
  • Error rate is specifically in the payment service (other services are fine)
  • The spike correlates with a deployment 10 minutes ago Step 3: Log Investigation (Logs) You query Loki for payment service errors from the last 15 minutes:
{job="payment-service", level="error"}
| json
|--|

Results show 847 errors with error_code: "DB_CONNECTION_TIMEOUT". Interesting. The new deployment probably introduced a database connection leak. Step 4: Trace Analysis You grab a specific trace ID from the logs and jump into Jaeger. The trace shows:

  • API Gateway: 50ms (normal)
  • Auth Service: 30ms (normal)
  • Payment Service: 12,000ms (OH NO)
    • Validate Payment: 100ms
    • Get DB Connection: 11,500ms ← HERE’S THE PROBLEM
    • Gateway Request: skipped (timed out before getting here) The new code is trying to acquire a database connection and timing out after 11.5 seconds. The connection pool is exhausted. Step 5: Root Cause You check the deployment diff and find that someone increased the connection pool’s max idle time from 2 minutes to 20 minutes. Combined with a traffic spike, the pool is now holding 500 connections instead of 50. The server doesn’t have enough resources. Timeline:
  • 14:32 - Bug detected via metric
  • 14:33 - Service identified via metric
  • 14:34 - Error cause identified via logs
  • 14:35 - Exact bottleneck pinpointed via traces
  • 14:36 - Root cause identified and fix deployed Without observability, this would have taken 2 hours and 5 frantic phone calls. With it, 4 minutes.

OpenTelemetry: The Universal Standard

I mentioned OpenTelemetry earlier, and now’s the time to dive deeper. OpenTelemetry (OTEL) is a unified instrumentation standard supported by all major cloud providers, and it’s the future of observability. Instead of using Prometheus client, Jaeger client, and custom logging, you use OpenTelemetry for everything.

from opentelemetry import trace, metrics, logs
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server
# Setup tracing
jaeger_exporter = JaegerExporter(
    agent_host_name='localhost',
    agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)
# Setup metrics
prometheus_reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[prometheus_reader]))
start_http_server(port=8000)
# Get instrumentation
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
# Create metrics
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)
request_histogram = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request latency"
)
# Use in your code
def process_payment(user_id, amount):
    with tracer.start_as_current_span("payment_processing") as span:
        span.set_attribute("user.id", user_id)
        span.set_attribute("amount", amount)
        start = time.time()
        try:
            # Your payment logic
            result = charge_card(user_id, amount)
            request_counter.add(1, {"status": "success"})
            return result
        except Exception as e:
            span.record_exception(e)
            span.set_attribute("error", True)
            request_counter.add(1, {"status": "error"})
            raise
        finally:
            duration = time.time() - start
            request_histogram.record(duration)

The beauty of OTEL is that you instrument once and can send data to any backend. Want to switch from Jaeger to Datadog? Just change the exporter. This vendor lock-in protection is worth its weight in gold.

Cost Breakdown: Budget Edition vs. Enterprise

Let me give you some real numbers:

ComponentBudgetEnterprise
Metrics Storage$50/month (cloud) or $0 (self-hosted)$500-2000/month
Log Storage$100/month (Loki) or $0 (self-hosted)$1000-5000/month
Trace Storage$0 (Jaeger self-hosted)$500-2000/month
Dashboard/Visualization$0 (Grafana open-source)$200-1000/month
Total Monthly$150-200$2200-9000

Even if you account for the operational overhead of running your own stack (let’s say you spend 10 hours per month managing it), you’re looking at $150 + (10 × $50 per hour) = $650 per month vs. $2200-9000 for a SaaS solution. And honestly? Running your own observability stack teaches you more about your systems than any SaaS platform ever could.

Practical Tips from the Trenches

1. Start with metrics, add logs, then traces Don’t try to implement full distributed tracing on day one. Begin with metrics to establish baselines, add structured logging to get context, then layer in traces when you need them. You’ll actually use each layer when you get there. 2. Label strategically Every metric and log needs labels/tags. But too many unique label values will blow out your storage. Use labels like service, endpoint, environment, user_type. Avoid user_id or request_id as labels—put those in logs or trace spans instead. 3. Set retention policies early Raw metrics can be kept for 30 days. Logs for 7-14 days. Traces for 24-48 hours. This isn’t you being cheap—it’s being smart. Long-term historical analysis can come from metrics alone. 4. Automate alerting based on anomalies, not fixed thresholds A static “alert when CPU > 80%” rule is useless. Your system might normally run at 75% CPU. Use metrics to establish normal behavior, then alert when things deviate from that baseline. 5. Make logs searchable by correlating with traces Every log entry should include the trace_id. This single change makes debugging distributed systems 10x easier. 6. Don’t alert on everything The biggest mistake teams make is creating too many alerts. You’ll get paged constantly for things that don’t matter, and you’ll start ignoring pages. Alert on user-facing issues only. Use dashboards for everything else.

Building Your First Dashboard

Let’s create a practical dashboard that shows the three pillars working together:

{
  "dashboard": {
    "title": "Service Health Overview",
    "panels": [
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])"
          }
        ]
      },
      {
        "title": "P95 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Recent Errors",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{level=\"error\"} | json | stats count() by error_code"
          }
        ]
      },
      {
        "title": "Slow Traces",
        "datasource": "Jaeger",
        "description": "Traces taking more than 1 second"
      }
    ]
  }
}

Common Pitfalls and How to Avoid Them

Pitfall 1: Storing everything You don’t need to keep all metrics forever or all traces indefinitely. The moment you try to store everything, costs explode and query performance tanks. Solution: Use cardinality limits. Set Prometheus to reject metric series with excessive unique label combinations. This forces good labeling practices. Pitfall 2: Not instrumenting error paths It’s easy to instrument happy paths but forget error handling. When something actually breaks, you’ll have no data. Solution: Instrument exceptions explicitly. Log them with full context. Create traces that wrap error handling. Pitfall 3: Logs that say nothing “Error occurred” is useless. “Error occurred during payment processing for user_id=usr_123 with amount=99.99, gateway returned status=402, retry_attempt=1” is actionable. Solution: Use structured logging everywhere. If you’re formatting log strings with string interpolation, you’re doing it wrong. Pitfall 4: Ignoring the business layer Technical metrics are great, but what about business metrics? How many orders completed? What’s the conversion rate? This is observability too. Solution: Expose business metrics alongside technical ones. Create custom metrics for your domain.

The Future: Observability 2.0

The industry is moving toward “Observability 2.0”—unified storage where you can click on a log, turn it into a trace, visualize it over time, and derive metrics from it. Instead of separate silos for metrics, logs, and traces, everything flows into a unified warehouse. Projects like ClickHouse-based observability platforms are making this accessible to budget-conscious teams. But even with traditional tools, the principle is the same: your three pillars should speak to each other.

Final Thoughts

Observability isn’t a feature you add; it’s a practice you build into your culture. It’s the difference between flying blind and having full instrument readings during a crisis. The good news? You don’t need a massive budget to implement it. You need understanding, discipline, and a little elbow grease. Start with metrics, graduate to logs, then embrace traces. Use open-source tools. Run them yourself. Learn from the experience. The day your production system goes down, you’ll be grateful you invested in observability. And your CEO will thank you for solving it in 4 minutes instead of 4 hours. Now go forth and observe.