ClickHouse OOM Recovery (Langfuse)
Purpose
Recovery runbook for the ClickHouse container that backs Langfuse. Langfuse uses ClickHouse to store traces, spans, and scores. A persistent OOM loop hit production on 2026-04-08 and was fixed via config tuning. This runbook captures both the fix and the recovery procedure if the issue recurs.
Prerequisites
- SSH + sudo access to the production host
- Access to Langfuse admin (credentials in
CREDENTIALS.md) - Familiarity with
docker statsanddocker-compose logs
Symptom
https://langfuse.huph.val.iddashboard is empty, blank, or times out on loaddocker statsshows ClickHouse container at 500%+ CPU or repeatedly OOM-killed- Langfuse worker/web containers log
connection refusedorClickHouse query timeout - API container logs show
langfuse: batch flush failed
History
2026-04-08 — initial occurrence and fix:
- Symptom: ClickHouse self-OOM loop. Container CPU jumped from baseline ~10% to 580% sustained.
- Root cause: ClickHouse's internal logging pipeline was generating
huge volumes of log messages, then trying to query its own
system.asynchronous_metric_logtable, which kept growing, which kept triggering more logging → runaway loop. No user queries hitting it during the loop. - Fix:
docker/clickhouse/config.d/logs.xmlwith: - Memory cap 2 GB for the process
- Disable noisy internal logs (
metric_log,query_thread_log,opentelemetry_span_log) - 7-day TTL on kept internal logs (
query_log,trace_log,text_log) to cap disk growth - Memory limit also bumped in
docker-compose.langfuse.yml: ClickHouse 2 GB → 3 GB to give headroom above the 2 GB internal cap.
Recovery procedure (if it recurs)
Step 1 — Confirm it is ClickHouse
docker stats --no-stream | grep clickhouse
Expected under normal load: low CPU (<20%), memory well under the 3 GB limit. If you see CPU at >100% sustained or memory near 3 GB, proceed.
Step 2 — Check ClickHouse logs
docker logs huph-clickhouse --tail 200 | tail -50
Look for:
DB::ExceptionlinesMemory limit exceededmessages- OOM killer messages (
oom-kill) - Self-query loops (same
system.queries repeating)
Step 3 — Verify the fix is still applied
cat /opt/huph/docker/clickhouse/config.d/logs.xml
Expected content: the memory cap 2 GB + logs disabled + TTL block. If this file is missing or has been reverted, restore it from git:
cd /opt/huph
git log --oneline -- docs/superpowers/ docker/clickhouse/
git checkout <sha> -- docker/clickhouse/config.d/logs.xml
Then restart ClickHouse:
docker-compose -f docker-compose.langfuse.yml restart clickhouse
Step 4 — If config is correct but OOM persists
Security check first (per feedback_security_before_fix):
- Confirm no unauthorized access via
sudo last/sudo lastb - Check for recent commits modifying the Langfuse stack
- Confirm traffic is normal via nginx access logs
Only after ruling out compromise:
# Restart just ClickHouse — preserves data
docker-compose -f docker-compose.langfuse.yml restart clickhouse
# If restart doesn't recover within 2 minutes, stop + start
docker-compose -f docker-compose.langfuse.yml stop clickhouse
docker-compose -f docker-compose.langfuse.yml up -d clickhouse
# NOT `down && up` — that can wipe volume state
Step 5 — Wait and verify
After restart, wait ~2 minutes for ClickHouse to rejoin the Langfuse stack. Then:
curl -s https://langfuse.huph.val.id/ -o /dev/null -w '%{http_code}\n'
# should be 200
docker stats --no-stream | grep clickhouse
# should show low CPU, memory well under 3 GB
Open the Langfuse dashboard in a browser, confirm traces are loading.
Step 6 — Monitor
Watch for 15 minutes after recovery:
watch -n 5 'docker stats --no-stream | grep clickhouse'
If CPU climbs again within 15 minutes, the config fix isn't
holding — escalate and consider a forward fix (e.g. further
disable query_log, or reduce retention).
If config is missing and can't be restored
Apply the fix manually:
- Create
/opt/huph/docker/clickhouse/config.d/logs.xml:
<clickhouse>
<max_server_memory_usage>2000000000</max_server_memory_usage>
<metric_log remove="1"/>
<query_thread_log remove="1"/>
<opentelemetry_span_log remove="1"/>
<asynchronous_metric_log remove="1"/>
<query_log>
<ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
</query_log>
<trace_log>
<ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
</trace_log>
<text_log>
<ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
</text_log>
</clickhouse>
- Make sure
docker-compose.langfuse.ymlmounts the config directory:
volumes:
- ./docker/clickhouse/config.d:/etc/clickhouse-server/config.d
- Restart ClickHouse:
docker-compose -f docker-compose.langfuse.yml restart clickhouse
- Verify the config was picked up:
docker exec huph-clickhouse ls /etc/clickhouse-server/config.d
What NOT to do
- Don't
docker-compose downandupthe Langfuse stack — it may wipe the ClickHouse data volume if volume configuration is off. Userestartor targetedstop/up. - Don't
TRUNCATEanything without confirming the system is not compromised. - Don't raise the memory limit higher than 4 GB without also
raising the internal
max_server_memory_usagein the XML config. The two must be kept in sync or ClickHouse will OOM-kill itself anyway. - Don't disable more log tables than the 4 listed above — some are needed for Langfuse's own functionality.
Gotchas
docker restartvsdocker-compose restart. Preferdocker-composevariants — they respect the compose config. Plaindocker restartworks but can miss env changes.- Langfuse stack has 8+ containers. Only restart the one that is broken. Restarting the whole stack is a bigger hammer than needed.
- Data is in a docker volume, not a bind mount. The volume
persists across
restart. If you everdocker-compose down -v, you lose everything —-vdeletes volumes. Don't. - The fix is XML, not environment variable. ClickHouse config is XML by design. If you're expecting a CLI flag or env var, you're going to be frustrated.
See also
- Incident playbook
- Debugging — observability stack
- Memory record:
project_langfuse_clickhouse_bug.md