Research

Findings we make public

May 11, 2026

Measuring which tool calls help agents debug

Marginal tool utility and tool efficiency measure whether individual tool calls improve an agent’s probability of solving the task. Removing noisy tools preserved accuracy while doubling efficiency.

Benchmark · Tool Efficiency
Marginal tool utility signs across default APEX-SWE Observability trajectories by GPT-5.3-Codex.
April 4, 2026 · Benchmark

Measuring root cause accuracy from telemetry

A benchmark for the question every debugging agent should answer: what caused the production failure? Evaluated on root cause analysis from telemetry, not log summarization.

Benchmark · Root Cause Accuracy
Root cause accuracy: Cursor plus MCP at 41%, Cursor plus Foam MCP at 64%, Foam at 86%.