You won't know which agent works best for you unless you benchmark your codebase.
Will Palmer
Will Palmer
15th January 2026
The Origin Story of Sigmabench

The year autocomplete died. 2025 was the year software moved from autocomplete to agents.

The interaction changed fundamentally. Instead of asking for help writing a function, developers started handing over entire tasks. Agents were expected to reason across a codebase, make decisions, and open a pull request.

The shift was sudden and highly visible. Every week brought new demos. Agents shipping features end to end. Solo founders building products in a weekend. Greenfield repositories moving at breathtaking speed. For leaders inside large engineering organizations, it quickly felt like the industry had moved on without them.

Then the pressure arrived.

Boards, investors, and CEOs all saw the same promise:

  • higher developer throughput
  • lower cost per feature
  • competitors moving faster by building with AI natively

Across many companies, the conclusion was the same. AI adoption in development teams had to increase.

The Adoption Problem Nobody Wanted to Say Out Loud

Inside enterprises, the reality was more complicated.

AI adoption wasn’t failing due to a lack of interest or ambition. It struggled because the tools were not reliably improving productivity on real codebases.

What we heard repeatedly from large organizations followed a familiar pattern.

  • Agents worked well on greenfield projects.
  • They were useful for exploratory or “vibe coding.”
  • But established codebases felt different. Too large. Too old. Too much legacy. Too much institutional knowledge.

That skepticism came from experience. Instead of dismissing it, we treated it as a signal.

The real question wasn’t why enterprises were slow to adopt. It was what would actually need to change for agentic coding to work inside large, long-lived codebases.

That tension surfaced publicly in July 2025.

A study from METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, reported a surprising result. When experienced developers were allowed to use AI tools, they took 19% longer to complete issues.

The gap between expectation and outcome was striking. Developers expected AI to speed them up by 24%. Even after completing the work, they still believed AI had sped them up by 20%. Measured outcomes showed the opposite.

The implication was clear. Belief, demos, and narrative had moved faster than measurement.

This matched what we were seeing firsthand. Agents were impressive on isolated tasks and new code. In complex, long-lived enterprise systems, results varied widely. Context was fragmented. Assumptions broke down. Small misunderstandings compounded. Engineers spent time steering and correcting output rather than shipping faster.

At that point, the industry had already answered whether agents could write code.

The harder question was how to make them effective inside complex codebases.

Starting With a Constraint

We had spent years building enterprise software. We understood legacy systems, institutional knowledge, and the operating realities of large development teams.

That experience shaped our approach. The skepticism around agentic coding wasn’t an obstacle to overcome. It was a design constraint.

Our goal was straightforward: make agentic coding meaningfully better in large, complex codebases.

We set one rule from the beginning.

Anything we built had to demonstrably improve agent performance.

  • Not through anecdotes.
  • Not through demos.
  • Through measurement that was repeatable and comparable.

If we couldn’t tell whether we were helping or hurting, and by how much, we couldn’t iterate. And we couldn’t earn the trust of the teams who needed this to work.

That requirement quietly shaped everything that followed.

The Benchmarking Gap

At the time, most benchmarks focused on models in isolation.

They were useful for tracking raw capability, but they didn’t answer the question we were grappling with: whether specific interventions actually improved agent performance on real codebases.

In practice, an agent is more than a model.

It is the harness and the model working together.

What we needed was a way to test changes end to end. Real repositories. Real pull requests. Real failure modes. And the ability to rerun those tests every time something changed.

That required a measuring instrument, not a leaderboard.

Sigmabench began as that instrument. At the time, it was simply a means to validate our own work.

The Context Assumption

The prevailing assumption across the industry was simple. Agents needed more context.

So teams shipped context. Markdown files. AI rules. Internal documentation exposed through MCP servers. We did the same.

Once we started measuring outcomes end to end, a consistent pattern emerged.

Additional context often degraded performance.

Agents slowed down. They latched onto irrelevant details. They overfit to rules. With more surface area, they became confidently wrong more often.

The conclusion became difficult to ignore.

Context only helps when it is high signal for the specific task the agent is performing. Otherwise, it acts as noise.

What Finally Worked: CodeNav

After many failed attempts, we changed direction.

Instead of expanding the main context window, we built CodeNav, a tool designed to help agents navigate large, complex codebases. CodeNav generated structured, high-signal summaries independent of the repository itself.

It focused on the questions agents repeatedly struggled with:

  • what the system is for
  • how it behaves
  • its constraints and invariants
  • how components relate

Agents used CodeNav through a sub-agent workflow before beginning work.

The results were dramatic.

Before Claude Code introduced Explore and Plan, CodeNav produced a 55% improvement in accuracy on complex, multi-file pull requests, along with a sharp reduction in variability.

When Sonnet 4.5 shipped, we reran the benchmark. Accuracy still improved by 33% relative to baseline.

The direction was clear. Structured understanding mattered far more than raw context volume.

Then the ground shifted.

When The Ground Moved Underneath Us

Claude Code released Explore and Plan.

The improvement itself wasn’t surprising. The mechanism behind it was.

Explore and Plan delivered similar gains without heavy repository preprocessing or persistent summaries. Sub-agents navigated code dynamically, guided by reasoning rather than stored representations.

After that release, the remaining performance gap between CodeNav and native Claude Code narrowed to roughly one margin of error. Still measurable, but no longer decisive.

That narrow gap carried an important signal.

When a standalone product only marginally outperforms a default experience from a foundation model provider, sustaining it as an independent business becomes extremely difficult.

This moment also clarified a broader insight.

Large performance gains often come from harness and workflow improvements, not just model changes. Without careful measurement, those gains are easy to misattribute.

So we paused. And we reassessed.

The Insight That Changed Everything

Stepping back, one capability stood out clearly.

In trying to make agentic coding work inside complex codebases, we had built a codebase-agnostic evaluation framework for coding agents.

It worked across repositories.

It produced deterministic, repeatable measurements.

And it answered a question few teams could answer objectively.

How well does this agent perform on our codebase?

As we ran more benchmarks, a deeper pattern emerged.

Agent performance varied 30 to 60% across superficially similar codebases. Same language. Similar size. Very different outcomes.

There was no universal ordering. Only context-specific results.

Why Sigmabench Exists

Sigmabench exists to bring measurement back into the conversation.

We benchmark coding agents on real pull request tasks and report outcomes that matter in day-to-day development:

  • Accuracy
  • Consistency
  • Speed

We then make that same benchmark available to development teams so they can evaluate new agents and adjacent tools on their own codebase.

With new agent and model combinations shipping every few days, tens of billions invested, and rapid progress across the ecosystem, intuition alone is no longer sufficient.

Sigmabench answers a practical question:

What is the best agent for our codebase, and where will we actually see the gains?

That is the origin story.

Not a theory.

A failed attempt to fix agentic coding that became a measurement problem, and then a product.