Jason Feil

linux • aws

Linux Observability Basics for Production

A focused starter stack for Linux telemetry and incident response, built for reliability under pressure.

• 6 min read

Linux observability is often treated as a tooling problem. Teams collect logs, install agents, and then hope visibility appears. In production, visibility comes from design choices: what signals you collect, how quickly you can interpret them, and whether your runbooks map directly to failure modes. Tool choice matters, but operating model matters more.

This post gives you a practical baseline that works in smaller teams and scales into larger environments. The objective is incident clarity: fast detection, useful context, and safe response actions.

What “Good Enough” Observability Looks Like

A useful Linux baseline answers four questions quickly:

  1. Is the node healthy?
  2. Is the service healthy?
  3. Is workload demand changing?
  4. Did something recently change?

If your current stack cannot answer these in 2-3 minutes during an incident, simplify first.

Core Signals You Need

At minimum, capture these metrics per node:

  • CPU utilization and load average
  • Memory available and reclaim pressure
  • Disk I/O utilization and queue depth
  • Network throughput and drops
  • Process count and top CPU/memory processes
  • File descriptor usage
  • OOM kill events

Example commands that still matter:

vmstat 1
iostat -xz 1
ss -s
dmesg --ctime | tail -n 50

These commands are not your observability platform, but they are excellent truth checks when dashboards lag.

Logging Strategy: Structured First

Unstructured logs slow down incident response. If everything is plain text, parsing during outages becomes fragile. Prefer structured fields (JSON or key-value). At minimum, include:

  • Timestamp in UTC
  • Service name
  • Environment
  • Request ID or trace ID
  • Error class/code
  • Host and region

Sample JSON log line:

{
  "ts": "2026-01-24T18:33:19Z",
  "service": "checkout-api",
  "env": "prod",
  "request_id": "0a9f-11",
  "error": "DB_TIMEOUT",
  "host": "ip-10-2-44-8"
}

Alerting Design: Fewer, Better Alerts

Noisy alerting burns response capacity. Page only when human action is required. Route informational alerts to async channels.

Recommended page-level alerts:

  • Sustained error budget burn
  • High tail latency (p95/p99) with customer impact
  • Critical queue backlog growth
  • Node-level failures reducing redundancy

Recommended non-page alerts:

  • Minor capacity drift
  • Single-node warning with no SLO impact
  • Short transient spikes

The right alert is one you can act on immediately.

A Practical Incident Triage Flow

Use this sequence every time:

  1. Confirm customer impact and blast radius
  2. Check recent deploy/config change timeline
  3. Validate node health (CPU, memory, disk, network)
  4. Inspect service error and latency patterns
  5. Compare against baseline and previous incident patterns
  6. Apply minimal-risk mitigation first

Task checklist example:

  • Impact confirmed
  • On-call owner assigned
  • Rollback/feature-flag option evaluated
  • Customer communication started
  • Root cause follow-up scheduled

Baseline Dashboards That Actually Help

Create four dashboards and stop there initially:

DashboardAudienceKey Widgets
Service healthOn-callerror rate, p95 latency, RPS
Node healthPlatformCPU, memory, disk I/O, network
Dependency healthApp teamsDB, cache, queue saturation
Incident contextEveryonedeploy markers, alert timeline

Too many dashboards fragment attention. Keep one default incident view.

Runbooks: Keep Them Executable

Runbooks should contain exact commands and expected output patterns, not general advice. A useful runbook section often includes:

  • Trigger condition
  • Verification steps
  • Mitigation options and risk level
  • Rollback path
  • Escalation contact

Example snippet:

# Verify system pressure
cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

# Check top offenders
ps aux --sort=-%mem | head -n 15

Capacity and Cost Together

Observability and FinOps should not be separated. If you only track performance, teams overprovision. If you only track cost, teams underprovision and risk reliability. Pair these metrics:

  • CPU utilization + cost per node hour
  • Memory headroom + cache hit ratio
  • Request latency + cost per 1k requests

This creates better optimization decisions, especially during scaling events.

Security Signal Integration

Operations and security signals overlap. Include these event classes in your observability baseline:

  • Sudden auth failures
  • Unexpected privileged process launches
  • Kernel warnings tied to exploit attempts
  • Egress traffic pattern anomalies

Do not build separate worlds for observability and security if both teams touch incident response.

Markdown Examples for Blog Authoring

This post includes multiple Markdown features you can reuse:

Lists and Emphasis

  • Bold for key ideas
  • Italics for nuance
  • Inline code for commands or field names

Blockquote

If you cannot explain an alert in one sentence, it is probably not ready for paging.

Table (service SLO example)

ServiceSLOAlert Trigger
API99.9% availability2% error rate for 10m
Queue workerbacklog < 5mbacklog > 20m for 15m
Authp95 < 250msp95 > 600ms for 10m

Collapsible details block

Example post-incident summary template
  • Incident ID
  • Impact window
  • Root cause
  • Mitigation
  • Prevention action items

45-Day Observability Rollout Plan

Days 1-10

  • Define incident questions and core signals
  • Normalize log schema for 2 critical services
  • Build service + node dashboards

Days 11-25

  • Reduce alert noise by severity routing
  • Write executable runbooks for top 3 incident classes
  • Add deploy markers to dashboards

Days 26-45

  • Add dependency dashboards
  • Link capacity signals with unit cost metrics
  • Run game-day exercise and improve runbooks

Final Takeaway

Linux observability basics are not about chasing every metric. They are about establishing a coherent operating loop: detect quickly, diagnose with context, mitigate safely, and learn systematically. If you can do that with a small set of reliable signals, your platform is in a much stronger place than teams with large but noisy telemetry stacks.

Start simple, iterate with real incident data, and keep your runbooks executable. That is how observability becomes an operational advantage instead of a dashboard hobby.

Related posts