Root cause accuracy from observability data

See benchmark for the code and data behind this result.

We published a benchmark for one question we believe every debugging agent should answer cleanly: what caused the production failure?

The benchmark evaluates root cause analysis from telemetry. It is not a log summarization task, and it is not a test of whether an agent can name a nearby service. The expected answer is the kind of answer an engineer can act on: the failing behavior, the relevant signal, and the fault that explains both.

On this benchmark, Foam reaches 86% RCA accuracy. Cursor with Sentry reaches 41%. Cursor with Foam reaches 64%.

The difference is not only model quality. It is the shape of the context. Agents are materially better when telemetry is presented as an investigation surface instead of as raw exhaust.

86%

Foam RCA accuracy

41%

Cursor + Sentry baseline

+23pp

Gained from OTel access alone

RCA accuracy on the public Foam benchmark

Cursor + Sentryfrontier model · no OTel access

41%

Cursor + Foamfrontier model · with OTel access

64%

Foam (Sonnet 4.6)smaller model · full system design

86%

github.com/foam-ai/benchmarks · 32 cases · 8 technical categories

The progression — what each step adds

41%

Cursor + Sentry

frontier · no OTel

+23pp

OTel access

64%

Cursor + Foam MCP

frontier · with OTel

+22pp

system design

86%

Foam (Sonnet 4.6)

smaller model · full system

Moving from no OTel access to OTel access on the same frontier model adds +23 percentage points. Then system design — on a cheaper model — adds another 22pp. The shape of the context is the leverage, not the model.

We use this benchmark as a product instrument. If Foam improves here, it means the system is getting better at the work customers actually ask it to do: isolate the fault, explain the evidence, and reduce the time between incident and fix.

Foam Research·April 4, 2026