Foam will be at AI Engineer in SF, June 29–July 2. Book a chat with us ›

Root cause accuracy from observability data

See benchmark for the code and data behind this result.

We published a benchmark for one question we believe every debugging agent should answer cleanly: what caused the production failure?

The benchmark evaluates root cause analysis from telemetry. It is not a log summarization task, and it is not a test of whether an agent can name a nearby service. The expected answer is the kind of answer an engineer can act on: the failing behavior, the relevant signal, and the fault that explains both.

On this benchmark, Foam reaches 86% RCA accuracy. Cursor with Sentry reaches 41%. Cursor with Foam reaches 64%.

The difference is not only model quality. It is the shape of the context. Agents are materially better when telemetry is presented as an investigation surface instead of as raw exhaust.

86%
Foam RCA accuracy
41%
Cursor + Sentry baseline
+23pp
Gained from OTel access alone
RCA accuracy on the public Foam benchmark
Cursor + Sentryfrontier model · no OTel access
41%
Cursor + Foamfrontier model · with OTel access
64%
Foam (Sonnet 4.6)smaller model · full system design
86%
github.com/foam-ai/benchmarks · 32 cases · 8 technical categories
The progression — what each step adds
41%
Cursor + Sentry
frontier · no OTel
+23pp
OTel access
64%
Cursor + Foam MCP
frontier · with OTel
+22pp
system design
86%
Foam (Sonnet 4.6)
smaller model · full system
Moving from no OTel access to OTel access on the same frontier model adds +23 percentage points. Then system design — on a cheaper model — adds another 22pp. The shape of the context is the leverage, not the model.

We use this benchmark as a product instrument. If Foam improves here, it means the system is getting better at the work customers actually ask it to do: isolate the fault, explain the evidence, and reduce the time between incident and fix.

Foam Research·