RCA Example

A recent and real root cause analysis Foam produced for the Foam team.

In this example, our CEO (also an IC on this issue) was pinged directly in Slack with:

FoamFoam

foamApp10:43 AM

@Perla Gámez The background worker is out of disk space. Months of accumulated worktrees and lockfiles were never cleaned up.

Here’s exactly what Foam showed her:

ENOSPC
Unhandled
background-worker
2h ago

ENOSPC: no space left on device, write

SpanID: a1b2c3d4e5f6
TraceID: 9f8e7d6c5b4a3210

Stacktrace

atJobRunner.execute(job-runner.ts:142)
atWorkerPool.dispatch(worker-pool.ts:87)
atObject.open(node:fs/promises)
atFSReqCallback.oncomplete(node:fs)

Solution

TL;DR

A background worker ran out of disk space because temporary files (git worktrees and lockfiles) accumulated without cleanup. The cleanup system existed in the codebase but was never enabled in production — a single missing environment variable left the entire maintenance subsystem disabled.


Root Cause

The trace for the failing job tells the whole story — every retry dies at the same fs.open() call, and none of them last more than 1.5 seconds:

trace 4a8f2c · background-worker
├─ JobRunner.execute                 1.2ms  ERROR
│  ├─ WorkerPool.dispatch            0.8ms  ERROR
│  │  └─ fs.open()                   0.3ms  ENOSPC
│  └─ [retry 1] WorkerPool.dispatch  0.9ms  ERROR
│     └─ fs.open()                   0.2ms  ENOSPC
└─ [retry 2] JobRunner.execute       1.1ms  ERROR
   └─ WorkerPool.dispatch            0.7ms  ERROR
      └─ fs.open()                   0.2ms  ENOSPC

All 6 attempts (3 retries × 2 queue attempts) failed instantly. No actual work was ever attempted — the volume was already full before the job started. The logs confirm this:

[12-24 20:45:47] ERROR  worker/job-runner  ENOSPC: no space left on device, write
[12-24 20:45:47] ERROR  worker/job-runner  job=fc8291 failed at init, no work attempted
[12-24 20:45:48] WARN   worker/queue       job=fc8291 exhausted retries (6/6), marking failed
[12-24 20:45:48] WARN   worker/queue       job=a1e803 exhausted retries (6/6), marking failed
[12-24 20:45:49] WARN   worker/queue       3 more jobs queued, all expected to fail

So the question is: why is the disk full? Every job writes temporary files — lockfiles and git worktrees — to a shared EBS mount at /mnt/ebs. These files persist across container restarts. Per-job cleanup intentionally skips worktrees (they're kept for 1 hour to allow reuse), so only the scheduled maintenance system is supposed to prune them.

But searching for performMaintenance or cleanupOldWorktrees in the last 30 days returns zero results. The maintenance system has never run. The metrics confirm the accumulation:

disk.usage{mount="/mnt/ebs"}
  7d ago: 42%  ·  3d ago: 71%  ·  1d ago: 94%  ·  now: 100%

worker.lockfile.count              11,847 files
worker.worktree.count               2,306 dirs
worker.worktree.cleanup.count       0 (last 30d)

Disk usage has been climbing linearly. The worktree count tracks with completed job count — nothing is being removed. Meanwhile, crashed workers orphan their lockfiles because lock cleanup runs in finally blocks, but OOM kills skip it entirely. The stale-lock TTL only fires when another job contends for the same lock, so the 11,847 lockfiles just accumulated silently.

The root cause: cleanup is gated behind MAINTENANCE_WORKER_ENABLED, which is absent from the production ECS task definition. The feature was built but never turned on. Once the volume hit 100%, failures became self-perpetuating — even creating a lockfile throws ENOSPC, so every retry fails before it starts.


Fix

1. Immediate — free disk space now

SSH into the worker host and purge stale artifacts manually:

# Identify the affected ECS task
aws ecs list-tasks --cluster prod-workers --service-name background-worker

# Exec into the container
aws ecs execute-command --cluster prod-workers \
  --task <task-id> --container worker \
  --interactive --command "/bin/sh"

# Inside the container: remove orphaned worktrees and lockfiles
find /mnt/ebs/worktrees -maxdepth 1 -mmin +60 -exec rm -rf {} +
find /mnt/ebs/locks -name "*.lock" -mmin +5 -delete
df -h /mnt/ebs  # confirm space recovered

2. Root fix — enable the maintenance subsystem

Add the missing environment variable to the ECS task definition:

# Update the task definition to include MAINTENANCE_WORKER_ENABLED
aws ecs describe-task-definition --task-definition background-worker \
  --query 'taskDefinition.containerDefinitions' > /tmp/containers.json

# Add to the environment array in containers.json:
# { "name": "MAINTENANCE_WORKER_ENABLED", "value": "true" }

aws ecs register-task-definition \
  --family background-worker \
  --container-definitions file:///tmp/containers.json \
  --task-role-arn arn:aws:iam::123456789:role/worker-task-role \
  --execution-role-arn arn:aws:iam::123456789:role/ecsTaskExecutionRole

aws ecs update-service --cluster prod-workers \
  --service background-worker \
  --task-definition background-worker --force-new-deployment

This enables the existing cleanup logic (worktree pruning at 24h TTL, lockfile cleanup at 5min TTL, git GC). At ~600 jobs/day, steady-state disk usage should stabilize around 8–12 GB — well within the 43 GB volume.