RCA Example
A recent and real root cause analysis Foam produced for the Foam team.
In this example, our CEO (also an IC on this issue) was pinged directly in Slack with:
Foam
@Perla Gámez The background worker is out of disk space. Months of accumulated worktrees and lockfiles were never cleaned up.
Here’s exactly what Foam showed her:
ENOSPC: no space left on device, write
Stacktrace
Solution
TL;DR
A background worker ran out of disk space because temporary files (git worktrees and lockfiles) accumulated without cleanup. The cleanup system existed in the codebase but was never enabled in production — a single missing environment variable left the entire maintenance subsystem disabled.
Root Cause
The trace for the failing job tells the whole story — every retry dies at the same fs.open() call, and none of them last more than 1.5 seconds:
trace 4a8f2c · background-worker ├─ JobRunner.execute 1.2ms ERROR │ ├─ WorkerPool.dispatch 0.8ms ERROR │ │ └─ fs.open() 0.3ms ENOSPC │ └─ [retry 1] WorkerPool.dispatch 0.9ms ERROR │ └─ fs.open() 0.2ms ENOSPC └─ [retry 2] JobRunner.execute 1.1ms ERROR └─ WorkerPool.dispatch 0.7ms ERROR └─ fs.open() 0.2ms ENOSPC
All 6 attempts (3 retries × 2 queue attempts) failed instantly. No actual work was ever attempted — the volume was already full before the job started. The logs confirm this:
[12-24 20:45:47] ERROR worker/job-runner ENOSPC: no space left on device, write [12-24 20:45:47] ERROR worker/job-runner job=fc8291 failed at init, no work attempted [12-24 20:45:48] WARN worker/queue job=fc8291 exhausted retries (6/6), marking failed [12-24 20:45:48] WARN worker/queue job=a1e803 exhausted retries (6/6), marking failed [12-24 20:45:49] WARN worker/queue 3 more jobs queued, all expected to fail
So the question is: why is the disk full? Every job writes temporary files — lockfiles and git worktrees — to a shared EBS mount at /mnt/ebs. These files persist across container restarts. Per-job cleanup intentionally skips worktrees (they're kept for 1 hour to allow reuse), so only the scheduled maintenance system is supposed to prune them.
But searching for performMaintenance or cleanupOldWorktrees in the last 30 days returns zero results. The maintenance system has never run. The metrics confirm the accumulation:
disk.usage{mount="/mnt/ebs"} 7d ago: 42% · 3d ago: 71% · 1d ago: 94% · now: 100% worker.lockfile.count 11,847 files worker.worktree.count 2,306 dirs worker.worktree.cleanup.count 0 (last 30d)
Disk usage has been climbing linearly. The worktree count tracks with completed job count — nothing is being removed. Meanwhile, crashed workers orphan their lockfiles because lock cleanup runs in finally blocks, but OOM kills skip it entirely. The stale-lock TTL only fires when another job contends for the same lock, so the 11,847 lockfiles just accumulated silently.
The root cause: cleanup is gated behind MAINTENANCE_WORKER_ENABLED, which is absent from the production ECS task definition. The feature was built but never turned on. Once the volume hit 100%, failures became self-perpetuating — even creating a lockfile throws ENOSPC, so every retry fails before it starts.
Fix
1. Immediate — free disk space now
SSH into the worker host and purge stale artifacts manually:
# Identify the affected ECS task aws ecs list-tasks --cluster prod-workers --service-name background-worker # Exec into the container aws ecs execute-command --cluster prod-workers \ --task <task-id> --container worker \ --interactive --command "/bin/sh" # Inside the container: remove orphaned worktrees and lockfiles find /mnt/ebs/worktrees -maxdepth 1 -mmin +60 -exec rm -rf {} + find /mnt/ebs/locks -name "*.lock" -mmin +5 -delete df -h /mnt/ebs # confirm space recovered
2. Root fix — enable the maintenance subsystem
Add the missing environment variable to the ECS task definition:
# Update the task definition to include MAINTENANCE_WORKER_ENABLED aws ecs describe-task-definition --task-definition background-worker \ --query 'taskDefinition.containerDefinitions' > /tmp/containers.json # Add to the environment array in containers.json: # { "name": "MAINTENANCE_WORKER_ENABLED", "value": "true" } aws ecs register-task-definition \ --family background-worker \ --container-definitions file:///tmp/containers.json \ --task-role-arn arn:aws:iam::123456789:role/worker-task-role \ --execution-role-arn arn:aws:iam::123456789:role/ecsTaskExecutionRole aws ecs update-service --cluster prod-workers \ --service background-worker \ --task-definition background-worker --force-new-deployment
This enables the existing cleanup logic (worktree pruning at 24h TTL, lockfile cleanup at 5min TTL, git GC). At ~600 jobs/day, steady-state disk usage should stabilize around 8–12 GB — well within the 43 GB volume.