Demo

See a recent report delivered to our CEO (also an IC in our team).

Foam

FoamApp10:43 AM

@Perla Gámez The background worker is out of disk space. Months of accumulated worktrees and lockfiles were never cleaned up.

ENOSPC

Unhandled

background-worker

2h ago

Error ENOSPC Device

Stacktrace

atJobRunner.execute(job-runner.ts:142)

atWorkerPool.dispatch(worker-pool.ts:87)

atObject.open(node:fs/promises)

atFSReqCallback.oncomplete(node:fs)

Solution

TL;DR

A background worker ran out of disk space because temporary files (git worktrees and lockfiles) accumulated without cleanup. The cleanup system existed in the codebase but was never enabled in production — a single missing environment variable left the entire maintenance subsystem disabled.

Root Cause

The trace for the failing job tells the whole story — every retry dies at the same fs.open() call, and none of them last more than 1.5 seconds:

trace 4a8f2c · background-worker
├─ JobRunner.execute                 1.2ms  ERROR
│  ├─ WorkerPool.dispatch            0.8ms  ERROR
│  │  └─ fs.open()                   0.3ms  ENOSPC
│  └─ [retry 1] WorkerPool.dispatch  0.9ms  ERROR
│     └─ fs.open()                   0.2ms  ENOSPC
└─ [retry 2] JobRunner.execute       1.1ms  ERROR
   └─ WorkerPool.dispatch            0.7ms  ERROR
      └─ fs.open()                   0.2ms  ENOSPC

All 6 attempts (3 retries × 2 queue attempts) failed instantly. No actual work was ever attempted — the volume was already full before the job started. The logs confirm this:

[12-24 20:45:47] ERROR  worker/job-runner  ENOSPC: no space left on device, write
[12-24 20:45:47] ERROR  worker/job-runner  job=fc8291 failed at init
[12-24 20:45:48] WARN   worker/queue       job=fc8291 exhausted retries
[12-24 20:45:49] WARN   worker/queue       3 more jobs queued, all expected to fail

Searching for performMaintenance or cleanupOldWorktrees in the last 30 days returns zero results. The maintenance system has never run, and the metrics show the accumulation:

disk.usage{mount="/mnt/ebs"}
  7d ago: 42%  ·  3d ago: 71%  ·  1d ago: 94%  ·  now: 100%

worker.lockfile.count              11,847 files
worker.worktree.count               2,306 dirs
worker.worktree.cleanup.count       0 (last 30d)

The root cause: cleanup is gated behind MAINTENANCE_WORKER_ENABLED, which is absent from the production ECS task definition. Once the volume hit 100%, failures became self-perpetuating because even creating a lockfile throws ENOSPC.

Fix

1. Immediate — free disk space now

Purge stale artifacts manually:

find /mnt/ebs/worktrees -maxdepth 1 -mmin +60 -exec rm -rf {} +
find /mnt/ebs/locks -name "*.lock" -mmin +5 -delete
df -h /mnt/ebs

2. Root fix — enable the maintenance subsystem

Add the missing environment variable and redeploy the worker:

MAINTENANCE_WORKER_ENABLED=true
aws ecs update-service --cluster prod-workers --service background-worker --force-new-deployment

This turns on existing cleanup for worktrees, lockfiles, and git GC.