linux • aws
Linux Observability Basics for Production
A focused starter stack for Linux telemetry and incident response, built for reliability under pressure.
• 6 min read
Linux observability is often treated as a tooling problem. Teams collect logs, install agents, and then hope visibility appears. In production, visibility comes from design choices: what signals you collect, how quickly you can interpret them, and whether your runbooks map directly to failure modes. Tool choice matters, but operating model matters more.
This post gives you a practical baseline that works in smaller teams and scales into larger environments. The objective is incident clarity: fast detection, useful context, and safe response actions.
What “Good Enough” Observability Looks Like
A useful Linux baseline answers four questions quickly:
- Is the node healthy?
- Is the service healthy?
- Is workload demand changing?
- Did something recently change?
If your current stack cannot answer these in 2-3 minutes during an incident, simplify first.
Core Signals You Need
At minimum, capture these metrics per node:
- CPU utilization and load average
- Memory available and reclaim pressure
- Disk I/O utilization and queue depth
- Network throughput and drops
- Process count and top CPU/memory processes
- File descriptor usage
- OOM kill events
Example commands that still matter:
vmstat 1
iostat -xz 1
ss -s
dmesg --ctime | tail -n 50
These commands are not your observability platform, but they are excellent truth checks when dashboards lag.
Logging Strategy: Structured First
Unstructured logs slow down incident response. If everything is plain text, parsing during outages becomes fragile. Prefer structured fields (JSON or key-value). At minimum, include:
- Timestamp in UTC
- Service name
- Environment
- Request ID or trace ID
- Error class/code
- Host and region
Sample JSON log line:
{
"ts": "2026-01-24T18:33:19Z",
"service": "checkout-api",
"env": "prod",
"request_id": "0a9f-11",
"error": "DB_TIMEOUT",
"host": "ip-10-2-44-8"
}
Alerting Design: Fewer, Better Alerts
Noisy alerting burns response capacity. Page only when human action is required. Route informational alerts to async channels.
Recommended page-level alerts:
- Sustained error budget burn
- High tail latency (
p95/p99) with customer impact - Critical queue backlog growth
- Node-level failures reducing redundancy
Recommended non-page alerts:
- Minor capacity drift
- Single-node warning with no SLO impact
- Short transient spikes
The right alert is one you can act on immediately.
A Practical Incident Triage Flow
Use this sequence every time:
- Confirm customer impact and blast radius
- Check recent deploy/config change timeline
- Validate node health (CPU, memory, disk, network)
- Inspect service error and latency patterns
- Compare against baseline and previous incident patterns
- Apply minimal-risk mitigation first
Task checklist example:
- Impact confirmed
- On-call owner assigned
- Rollback/feature-flag option evaluated
- Customer communication started
- Root cause follow-up scheduled
Baseline Dashboards That Actually Help
Create four dashboards and stop there initially:
| Dashboard | Audience | Key Widgets |
|---|---|---|
| Service health | On-call | error rate, p95 latency, RPS |
| Node health | Platform | CPU, memory, disk I/O, network |
| Dependency health | App teams | DB, cache, queue saturation |
| Incident context | Everyone | deploy markers, alert timeline |
Too many dashboards fragment attention. Keep one default incident view.
Runbooks: Keep Them Executable
Runbooks should contain exact commands and expected output patterns, not general advice. A useful runbook section often includes:
- Trigger condition
- Verification steps
- Mitigation options and risk level
- Rollback path
- Escalation contact
Example snippet:
# Verify system pressure
cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io
# Check top offenders
ps aux --sort=-%mem | head -n 15
Capacity and Cost Together
Observability and FinOps should not be separated. If you only track performance, teams overprovision. If you only track cost, teams underprovision and risk reliability. Pair these metrics:
- CPU utilization + cost per node hour
- Memory headroom + cache hit ratio
- Request latency + cost per 1k requests
This creates better optimization decisions, especially during scaling events.
Security Signal Integration
Operations and security signals overlap. Include these event classes in your observability baseline:
- Sudden auth failures
- Unexpected privileged process launches
- Kernel warnings tied to exploit attempts
- Egress traffic pattern anomalies
Do not build separate worlds for observability and security if both teams touch incident response.
Markdown Examples for Blog Authoring
This post includes multiple Markdown features you can reuse:
Lists and Emphasis
- Bold for key ideas
- Italics for nuance
Inline codefor commands or field names
Links
Blockquote
If you cannot explain an alert in one sentence, it is probably not ready for paging.
Table (service SLO example)
| Service | SLO | Alert Trigger |
|---|---|---|
| API | 99.9% availability | 2% error rate for 10m |
| Queue worker | backlog < 5m | backlog > 20m for 15m |
| Auth | p95 < 250ms | p95 > 600ms for 10m |
Collapsible details block
Example post-incident summary template
- Incident ID
- Impact window
- Root cause
- Mitigation
- Prevention action items
45-Day Observability Rollout Plan
Days 1-10
- Define incident questions and core signals
- Normalize log schema for 2 critical services
- Build service + node dashboards
Days 11-25
- Reduce alert noise by severity routing
- Write executable runbooks for top 3 incident classes
- Add deploy markers to dashboards
Days 26-45
- Add dependency dashboards
- Link capacity signals with unit cost metrics
- Run game-day exercise and improve runbooks
Final Takeaway
Linux observability basics are not about chasing every metric. They are about establishing a coherent operating loop: detect quickly, diagnose with context, mitigate safely, and learn systematically. If you can do that with a small set of reliable signals, your platform is in a much stronger place than teams with large but noisy telemetry stacks.
Start simple, iterate with real incident data, and keep your runbooks executable. That is how observability becomes an operational advantage instead of a dashboard hobby.