Foam will be at AI Engineer in SF, June 29–July 2. Book a chat with us ›

Marginal tool utility in agentic debugging

Download PDF

The paper introduces two measurements for agent trajectories. Marginal tool utility asks whether a specific tool call made the agent more likely to solve the task. Tool efficiency measures the share of tool calls that were useful.

We evaluated these ideas on APEX-SWE Observability, where agents solve software debugging tasks with access to read only MCP tools. The default harness included Grafana/Loki, Mattermost, and Plane. The ablated harness kept only Grafana/Loki.

The result was the pattern the metric was designed to reveal. Removing noisy tools preserved task accuracy while increasing tool efficiency.

For GPT-5.3-Codex, mean tool efficiency moved from 0.359 with the default suite to 0.720 with Grafana/Loki only. For Gemini 3.1 Pro, it moved from 0.367 to 0.593.

Definitions

Marginal Tool Utility

Does a specific tool call increase the agent’s probability of solving the task? Positive, negative, or zero per call.

Tool Efficiency

The share of tool calls in a trajectory that carried positive marginal utility. A number worth optimizing.

Tool efficiency before and after removing noisy tools

GPT-5.3-CODEX

Default suite (Grafana + Mattermost + Plane)0.359
Grafana / Loki only0.720
+100% efficiencyaccuracy preserved

GEMINI 3.1 PRO

Default suite (Grafana + Mattermost + Plane)0.367
Grafana / Loki only0.593
+62% efficiencyaccuracy preserved

MTU signs by tool — GPT-5.3-Codex · default APEX-SWE trajectories

PositiveNegativeZero
Grafana / Loki
+62% / −18% / 0:20%
Mattermost
+22% / −48% / 0:30%
Plane
+18% / −55% / 0:27%

Mattermost and Plane produce predominantly negative MTU — they make the agent less likely to solve the task.

The result is not that agents should always receive fewer tools. It is that tool suites deserve measurement. Some tools add signal. Some add noise. Marginal tool utility gives us a way to separate the two.

The lesson is not that agents should always receive fewer tools. The lesson is that tool suites deserve measurement. Some tools add signal. Some add noise. Marginal tool utility gives us a way to separate the two, and tool efficiency gives us a number worth optimizing.

Foam Research·