Runtime Zero
ESC
Browse by topic
Articles  /  Kubernetes

Kubernetes Observability with OpenTelemetry

OpenTelemetry has unified the instrumentation story across traces, metrics, and logs. This post covers deploying the OTel Operator on Kubernetes, auto-instrumentation, and routing telemetry to Prometheus and Jaeger.

DC

Before OpenTelemetry, getting traces, metrics, and logs out of Kubernetes applications meant three different SDKs, three different agents, and three different pipelines. The OTel project has largely solved this — here's how to deploy it properly.

The OpenTelemetry Operator

The OTel Operator is a Kubernetes controller that manages OpenTelemetryCollector and Instrumentation custom resources. Install it via Helm:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry-operator-system \
  --create-namespace \
  --set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib"

Deploying a Collector in Gateway Mode

Run a central collector that receives telemetry from all workloads and fans it out to backends:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: gateway
  namespace: monitoring
spec:
  mode: Deployment
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 10s
      memory_limiter:
        check_interval: 1s
        limit_mib: 512

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
      loki:
        endpoint: http://loki:3100/loki/api/v1/push

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [jaeger]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheus]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

Auto-Instrumentation: Zero Code Changes

The operator's Instrumentation resource injects OTel SDKs into pods at admission time. For Java, Python, Node.js, and .NET, this requires no code changes:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
  namespace: my-app
spec:
  exporter:
    endpoint: http://gateway-collector.monitoring:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest

Annotate pods to opt in:

annotations:
  instrumentation.opentelemetry.io/inject-java: "true"

The operator mutates the pod spec to add an init container that downloads the SDK and sets JAVA_TOOL_OPTIONS — all without touching application manifests beyond the annotation.

Correlating Traces and Logs

The most valuable OTel feature for troubleshooting is trace-log correlation. Ensure your log exporter injects trace_id and span_id into log records, then configure Grafana to link from a Loki log line to the corresponding Jaeger trace. The result: from an error log, one click takes you to the full distributed trace.

Set up the correlation in Grafana's datasource configuration under Derived Fields on the Loki datasource — it's a regex on the log line that extracts the trace ID and links to Jaeger.

What to Instrument First

Don't try to instrument everything at once. Prioritise:

  1. HTTP/gRPC service boundaries — auto-instrumentation covers these automatically.
  2. Database calls — most OTel SDKs have database instrumentation plugins.
  3. Message queue producers/consumers — Kafka, RabbitMQ SDKs available for all major languages.

Leave internal library calls for later. The 20% of instrumentation effort that traces crossing service boundaries gives you 80% of the diagnostic value.