Lewati ke isi

ClickHouse OOM Recovery (Langfuse)

Purpose

Recovery runbook for the ClickHouse container that backs Langfuse. Langfuse uses ClickHouse to store traces, spans, and scores. A persistent OOM loop hit production on 2026-04-08 and was fixed via config tuning. This runbook captures both the fix and the recovery procedure if the issue recurs.

Prerequisites

  • SSH + sudo access to the production host
  • Access to Langfuse admin (credentials in CREDENTIALS.md)
  • Familiarity with docker stats and docker-compose logs

Symptom

  • https://langfuse.huph.val.id dashboard is empty, blank, or times out on load
  • docker stats shows ClickHouse container at 500%+ CPU or repeatedly OOM-killed
  • Langfuse worker/web containers log connection refused or ClickHouse query timeout
  • API container logs show langfuse: batch flush failed

History

2026-04-08 — initial occurrence and fix:

  • Symptom: ClickHouse self-OOM loop. Container CPU jumped from baseline ~10% to 580% sustained.
  • Root cause: ClickHouse's internal logging pipeline was generating huge volumes of log messages, then trying to query its own system.asynchronous_metric_log table, which kept growing, which kept triggering more logging → runaway loop. No user queries hitting it during the loop.
  • Fix: docker/clickhouse/config.d/logs.xml with:
  • Memory cap 2 GB for the process
  • Disable noisy internal logs (metric_log, query_thread_log, opentelemetry_span_log)
  • 7-day TTL on kept internal logs (query_log, trace_log, text_log) to cap disk growth
  • Memory limit also bumped in docker-compose.langfuse.yml: ClickHouse 2 GB → 3 GB to give headroom above the 2 GB internal cap.

Recovery procedure (if it recurs)

Step 1 — Confirm it is ClickHouse

docker stats --no-stream | grep clickhouse

Expected under normal load: low CPU (<20%), memory well under the 3 GB limit. If you see CPU at >100% sustained or memory near 3 GB, proceed.

Step 2 — Check ClickHouse logs

docker logs huph-clickhouse --tail 200 | tail -50

Look for:

  • DB::Exception lines
  • Memory limit exceeded messages
  • OOM killer messages (oom-kill)
  • Self-query loops (same system. queries repeating)

Step 3 — Verify the fix is still applied

cat /opt/huph/docker/clickhouse/config.d/logs.xml

Expected content: the memory cap 2 GB + logs disabled + TTL block. If this file is missing or has been reverted, restore it from git:

cd /opt/huph
git log --oneline -- docs/superpowers/ docker/clickhouse/
git checkout <sha> -- docker/clickhouse/config.d/logs.xml

Then restart ClickHouse:

docker-compose -f docker-compose.langfuse.yml restart clickhouse

Step 4 — If config is correct but OOM persists

Security check first (per feedback_security_before_fix):

  • Confirm no unauthorized access via sudo last / sudo lastb
  • Check for recent commits modifying the Langfuse stack
  • Confirm traffic is normal via nginx access logs

Only after ruling out compromise:

# Restart just ClickHouse — preserves data
docker-compose -f docker-compose.langfuse.yml restart clickhouse

# If restart doesn't recover within 2 minutes, stop + start
docker-compose -f docker-compose.langfuse.yml stop clickhouse
docker-compose -f docker-compose.langfuse.yml up -d clickhouse

# NOT `down && up` — that can wipe volume state

Step 5 — Wait and verify

After restart, wait ~2 minutes for ClickHouse to rejoin the Langfuse stack. Then:

curl -s https://langfuse.huph.val.id/ -o /dev/null -w '%{http_code}\n'
# should be 200

docker stats --no-stream | grep clickhouse
# should show low CPU, memory well under 3 GB

Open the Langfuse dashboard in a browser, confirm traces are loading.

Step 6 — Monitor

Watch for 15 minutes after recovery:

watch -n 5 'docker stats --no-stream | grep clickhouse'

If CPU climbs again within 15 minutes, the config fix isn't holding — escalate and consider a forward fix (e.g. further disable query_log, or reduce retention).

If config is missing and can't be restored

Apply the fix manually:

  1. Create /opt/huph/docker/clickhouse/config.d/logs.xml:
<clickhouse>
  <max_server_memory_usage>2000000000</max_server_memory_usage>

  <metric_log remove="1"/>
  <query_thread_log remove="1"/>
  <opentelemetry_span_log remove="1"/>
  <asynchronous_metric_log remove="1"/>

  <query_log>
    <ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
  </query_log>
  <trace_log>
    <ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
  </trace_log>
  <text_log>
    <ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
  </text_log>
</clickhouse>
  1. Make sure docker-compose.langfuse.yml mounts the config directory:
volumes:
  - ./docker/clickhouse/config.d:/etc/clickhouse-server/config.d
  1. Restart ClickHouse:
docker-compose -f docker-compose.langfuse.yml restart clickhouse
  1. Verify the config was picked up:
docker exec huph-clickhouse ls /etc/clickhouse-server/config.d

What NOT to do

  • Don't docker-compose down and up the Langfuse stack — it may wipe the ClickHouse data volume if volume configuration is off. Use restart or targeted stop/up.
  • Don't TRUNCATE anything without confirming the system is not compromised.
  • Don't raise the memory limit higher than 4 GB without also raising the internal max_server_memory_usage in the XML config. The two must be kept in sync or ClickHouse will OOM-kill itself anyway.
  • Don't disable more log tables than the 4 listed above — some are needed for Langfuse's own functionality.

Gotchas

  1. docker restart vs docker-compose restart. Prefer docker-compose variants — they respect the compose config. Plain docker restart works but can miss env changes.
  2. Langfuse stack has 8+ containers. Only restart the one that is broken. Restarting the whole stack is a bigger hammer than needed.
  3. Data is in a docker volume, not a bind mount. The volume persists across restart. If you ever docker-compose down -v, you lose everything — -v deletes volumes. Don't.
  4. The fix is XML, not environment variable. ClickHouse config is XML by design. If you're expecting a CLI flag or env var, you're going to be frustrated.

See also