All research

Measuring which tool calls help agents debug

Download paper, the submitted PDF behind this note.

This is our NeurIPS submission on tool efficient agents.

The paper introduces two measurements for agent trajectories. Marginal tool utility asks whether a specific tool call made the agent more likely to solve the task. Tool efficiency measures the share of tool calls that were useful.

We evaluated these ideas on APEX-SWE Observability, where agents solve software debugging tasks with access to read only MCP tools. The default harness included Grafana/Loki, Mattermost, and Plane. The ablated harness kept only Grafana/Loki.

The result was the pattern the metric was designed to reveal. Removing noisy tools preserved task accuracy while increasing tool efficiency.

For GPT-5.3-Codex, mean tool efficiency moved from 0.359 with the default suite to 0.720 with Grafana/Loki only. For Gemini 3.1 Pro, it moved from 0.367 to 0.593.

Bar chart of marginal tool utility signs across all default trajectories by GPT-5.3-Codex.
Marginal tool utility signs across default APEX-SWE Observability trajectories by GPT-5.3-Codex.

The lesson is not that agents should always receive fewer tools. The lesson is that tool suites deserve measurement. Some tools add signal. Some add noise. Marginal tool utility gives us a way to separate the two, and tool efficiency gives us a number worth optimizing.

Foam Research·