Measuring which tool calls help agents debug
Download paper, the submitted PDF behind this note.
This is our NeurIPS submission on tool efficient agents.
The paper introduces two measurements for agent trajectories. Marginal tool utility asks whether a specific tool call made the agent more likely to solve the task. Tool efficiency measures the share of tool calls that were useful.
We evaluated these ideas on APEX-SWE Observability, where agents solve software debugging tasks with access to read only MCP tools. The default harness included Grafana/Loki, Mattermost, and Plane. The ablated harness kept only Grafana/Loki.
The result was the pattern the metric was designed to reveal. Removing noisy tools preserved task accuracy while increasing tool efficiency.
For GPT-5.3-Codex, mean tool efficiency moved from 0.359 with the default suite to 0.720 with Grafana/Loki only. For Gemini 3.1 Pro, it moved from 0.367 to 0.593.
The lesson is not that agents should always receive fewer tools. The lesson is that tool suites deserve measurement. Some tools add signal. Some add noise. Marginal tool utility gives us a way to separate the two, and tool efficiency gives us a number worth optimizing.