Skip to content

Incident Playbook

Purpose

Structured response for production incidents. Whether it's a 5xx spike, a stuck RAG service, or a Langfuse dashboard that just went dark, this page gives you a consistent flow so you're not improvising at 2 AM.

Prerequisites

  • SSH + sudo access to the production host
  • Access to Phoenix + Langfuse + admin dashboard
  • The contact list (team lead, on-call engineer)

Incident phases

    ┌──────────┐
    │ Detect   │ ← alert fires or user report
    └────┬─────┘
         ↓
    ┌──────────┐
    │ Triage   │ ← scope, severity, impact
    └────┬─────┘
         ↓
    ┌──────────┐
    │ Contain  │ ← stop the bleeding
    └────┬─────┘
         ↓
    ┌──────────┐
    │ Fix      │ ← root cause + targeted change
    └────┬─────┘
         ↓
    ┌──────────┐
    │ Verify   │ ← smoke, monitor, wait
    └────┬─────┘
         ↓
    ┌──────────┐
    │ Postmortem│ ← document, action items
    └──────────┘

1. Detect

What counts as an incident?

  • P0 (drop everything): WhatsApp bot not responding, admin dashboard down, data loss, security compromise
  • P1 (respond in <15 min): Elevated error rate in API, RAG unresponsive, partial admin feature broken, realtime events not flowing
  • P2 (respond within hours): Single feature bug, slow queries, Langfuse dashboard blank
  • P3 (next business day): Non-blocking warnings, cosmetic issues

Detection sources:

  • User reports (counselor Slack, marketing manager)
  • Phoenix trace anomaly
  • Langfuse cost spike
  • docker stats showing runaway container
  • Health endpoint returning non-200
  • journalctl -u huph-admin errors
  • docker-compose logs huph-api error lines

2. Triage

Answer these 4 questions in the first 2 minutes:

  1. What is happening? (one-sentence symptom)
  2. Who is affected? (counselors, users, both, partial)
  3. How severe? (P0–P3)
  4. What is the blast radius? (which component, which users, which data)

Create a triage note in the incident channel:

"P1 incident: admin realtime stuck Offline for all counselors, started ~14:32 WIB, investigating. Not a data issue."

3. Contain

Stop the bleeding before fixing the root cause. Containment options:

  • Rollback the last deploy — see rollback.en.md. This is the most common and cleanest containment.
  • Toggle a feature flag off — e.g. LEAD_CAPTURE_ENABLED=false, ESCALATION_ROUTING_ENABLED=false. Env flip + container restart (~30 s).
  • Restart the misbehaving containerdocker-compose restart <service>. Fast, but masks the underlying issue.
  • Scale down the offending service if CPU/memory is the issue and a restart doesn't help.
  • Disable nginx route for the broken endpoint while you fix it.

Security-first rule (from feedback_security_before_fix memory): when containment involves a resource anomaly (high CPU, high memory, weird traffic), verify the system is not compromised BEFORE restarting or TRUNCATE-ing. Check:

  • docker stats for unexpected containers
  • sudo last / sudo lastb for auth attempts
  • Recent commits in the branch you're deploying from
  • sudo iptables -L -n -v for traffic anomalies

NEVER execute destructive actions without explicit go-ahead.

4. Fix

Root cause analysis. Not "what restart fixed it", but "why did it break in the first place".

RCA checklist

  • [ ] What was the most recent change (commit, deploy, env flip)?
  • [ ] Is there a correlation between the change time and the incident start time?
  • [ ] Is it reproducible locally?
  • [ ] What does the log say at the exact moment of failure?
  • [ ] Is there a metric anomaly in Phoenix / Langfuse?
  • [ ] Did a dependency fail (360dialog? Anthropic? Dify?)
  • [ ] Did a migration run that broke a precondition?

Targeted fix

Once you have the root cause, make the smallest possible change to fix it. Do not bundle cleanup, refactors, or tangential fixes. Each extra change is risk.

Commit the fix with a clear message referencing the incident:

fix(api): add NEXTAUTH_SECRET to docker-compose env passthrough

Incident 2026-04-08 14:32 WIB: socket clients stuck Offline because
the API container was missing NEXTAUTH_SECRET — it was in .env but
not passed through to the container. JWE decode silently failed,
auth middleware rejected every upgrade request.

Fix: add `NEXTAUTH_SECRET=${NEXTAUTH_SECRET}` to the huph-api
environment block in docker-compose.yml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

5. Verify

  • Smoke test the affected surface (see the relevant smoke runbook)
  • Watch logs for 5-15 minutes after the fix
  • Check Phoenix traces for the original failure mode
  • Get a counselor to re-run the workflow that triggered the report

Do not mark the incident resolved until verification is green.

6. Postmortem

Write a postmortem within 48 hours. Even for P2 issues — the habit matters more than the specific document.

Postmortem template

# Incident 2026-MM-DD — <short summary>

## Timeline (WIB)

- HH:MM — incident started
- HH:MM — detected via <source>
- HH:MM — triaged as P<N>
- HH:MM — containment action: <...>
- HH:MM — root cause identified: <...>
- HH:MM — fix deployed: <commit SHA>
- HH:MM — verified resolved

## Impact

- Duration: X minutes
- Affected users: <counselors / prospects / both>
- Data impact: <none | lost / duplicated / corrupted>
- Business impact: <none | X leads not captured during window>

## Root cause

One paragraph, technically precise.

## Five whys

1. Why did the service fail?
2. Why was that condition present?
3. Why was that condition not caught earlier?
4. Why does the system not prevent this?
5. Why is this class of issue not monitored?

## What went well

- <fast detection, clean rollback, etc>

## What went poorly

- <slow detection, manual-only remediation, etc>

## Action items

| # | Action | Owner | Due |
|---|---|---|---|
| 1 | Add alerting for <X> | @engineer | 2026-MM-DD |
| 2 | Document runbook for <Y> | @engineer | 2026-MM-DD |

## Blameless tone

This is not about finger-pointing. It's about what the system let
slip through. Assume everyone acted with good intent and the
information they had at the time.

Incident communication

  • Internal channel first — alert the team immediately
  • User-facing status page — (not currently implemented for HUPH; add it as an action item when user-facing outages start happening)
  • Post-resolution update in the team channel with 1-line summary + postmortem link

Gotchas

  1. Don't fix and forget. Even a 5-minute P2 needs a postmortem doc, otherwise the same class of bug comes back.
  2. Don't skip triage. Jumping straight to fix without understanding scope often makes it worse.
  3. Containment ≠ fix. A restart is containment, not a fix. The root cause still needs resolving.
  4. Rolling back is always an option. If you're not making progress in 15 minutes, roll back and investigate calmly.
  5. Security-first rule applies to containment, not just fixes. Don't destroy evidence (TRUNCATE, force-restart, config-wipe) without confirming you're not destroying an attacker's footprint.

See also