Demo
See a recent report delivered to our CEO (also an IC in our team).
Foam
@Perla Gámez The background worker is out of disk space. Months of accumulated worktrees and lockfiles were never cleaned up.
Error ENOSPC Device
Stacktrace
Solution
TL;DR
A background worker ran out of disk space because temporary files (git worktrees and lockfiles) accumulated without cleanup. The cleanup system existed in the codebase but was never enabled in production — a single missing environment variable left the entire maintenance subsystem disabled.
Root Cause
The trace for the failing job tells the whole story — every retry dies at the same fs.open() call, and none of them last more than 1.5 seconds:
trace 4a8f2c · background-worker ├─ JobRunner.execute 1.2ms ERROR │ ├─ WorkerPool.dispatch 0.8ms ERROR │ │ └─ fs.open() 0.3ms ENOSPC │ └─ [retry 1] WorkerPool.dispatch 0.9ms ERROR │ └─ fs.open() 0.2ms ENOSPC └─ [retry 2] JobRunner.execute 1.1ms ERROR └─ WorkerPool.dispatch 0.7ms ERROR └─ fs.open() 0.2ms ENOSPC
All 6 attempts (3 retries × 2 queue attempts) failed instantly. No actual work was ever attempted — the volume was already full before the job started. The logs confirm this:
[12-24 20:45:47] ERROR worker/job-runner ENOSPC: no space left on device, write [12-24 20:45:47] ERROR worker/job-runner job=fc8291 failed at init [12-24 20:45:48] WARN worker/queue job=fc8291 exhausted retries [12-24 20:45:49] WARN worker/queue 3 more jobs queued, all expected to fail
Searching for performMaintenance or cleanupOldWorktrees in the last 30 days returns zero results. The maintenance system has never run, and the metrics show the accumulation:
disk.usage{mount="/mnt/ebs"} 7d ago: 42% · 3d ago: 71% · 1d ago: 94% · now: 100% worker.lockfile.count 11,847 files worker.worktree.count 2,306 dirs worker.worktree.cleanup.count 0 (last 30d)
The root cause: cleanup is gated behind MAINTENANCE_WORKER_ENABLED, which is absent from the production ECS task definition. Once the volume hit 100%, failures became self-perpetuating because even creating a lockfile throws ENOSPC.
Fix
1. Immediate — free disk space now
Purge stale artifacts manually:
find /mnt/ebs/worktrees -maxdepth 1 -mmin +60 -exec rm -rf {} + find /mnt/ebs/locks -name "*.lock" -mmin +5 -delete df -h /mnt/ebs
2. Root fix — enable the maintenance subsystem
Add the missing environment variable and redeploy the worker:
MAINTENANCE_WORKER_ENABLED=true aws ecs update-service --cluster prod-workers --service background-worker --force-new-deployment
This turns on existing cleanup for worktrees, lockfiles, and git GC.