Marginal tool utility in agentic debugging
Download PDFThe paper introduces two measurements for agent trajectories. Marginal tool utility asks whether a specific tool call made the agent more likely to solve the task. Tool efficiency measures the share of tool calls that were useful.
We evaluated these ideas on APEX-SWE Observability, where agents solve software debugging tasks with access to read only MCP tools. The default harness included Grafana/Loki, Mattermost, and Plane. The ablated harness kept only Grafana/Loki.
The result was the pattern the metric was designed to reveal. Removing noisy tools preserved task accuracy while increasing tool efficiency.
For GPT-5.3-Codex, mean tool efficiency moved from 0.359 with the default suite to 0.720 with Grafana/Loki only. For Gemini 3.1 Pro, it moved from 0.367 to 0.593.
Definitions
Marginal Tool Utility
Does a specific tool call increase the agent’s probability of solving the task? Positive, negative, or zero per call.
Tool Efficiency
The share of tool calls in a trajectory that carried positive marginal utility. A number worth optimizing.
Tool efficiency before and after removing noisy tools
GPT-5.3-CODEX
GEMINI 3.1 PRO
MTU signs by tool — GPT-5.3-Codex · default APEX-SWE trajectories
Mattermost and Plane produce predominantly negative MTU — they make the agent less likely to solve the task.
The lesson is not that agents should always receive fewer tools. The lesson is that tool suites deserve measurement. Some tools add signal. Some add noise. Marginal tool utility gives us a way to separate the two, and tool efficiency gives us a number worth optimizing.