Executive Summary

Sigmabench is designed to answer a practical question facing engineering leaders: how well do modern coding agents perform when used as complete systems, not just as large language models in isolation? In real-world settings, outcomes are shaped as much by an agent's orchestration logic, tool use, and execution loop as by the underlying model itself. Sigmabench therefore evaluates full agent + model pairings exactly as developers would encounter them, focusing on end-to-end behavior in realistic software change scenarios rather than abstract prompt-response tasks.

The benchmark is grounded in a structured dataset derived from real open-source development work, spanning multiple programming languages, repository sizes, and change scopes. Each task is framed as a concrete engineering requirement inferred from an actual historical commit, and agents are asked to implement it autonomously in a standardized, constrained environment. This design emphasizes comparability, reproducibility, and relevance to everyday development workflows, while remaining agnostic to the internal techniques each agent uses.

Performance is assessed along three dimensions that map directly to developer value: accuracy, consistency, and speed. Accuracy reflects whether an agent produces an acceptable result; consistency captures whether failures still leave useful partial work; and speed measures whether the agent fits smoothly into a developer's workflow. Results are reported with statistical rigor and conservative ranking rules, alongside explicit limitations around prompt generation, scoring granularity, and dataset scope. The goal is not a definitive measure of “coding intelligence,” but a decision-useful signal about how agentic coding systems perform in practice and where meaningful differences emerge.

Methodology versioning

This page documents the Sigmabench methodology v1, frozen as of December 2025.

Future releases of Sigmabench will publish updated methodology documents as v2, v3, etc. For now, there is only v1.

Purpose and Scope

Sigmabench is designed to evaluate agentic coding systems—the combination of a coding agent’s execution harness and its underlying language model—rather than language models in isolation. Modern coding agents incorporate substantial logic beyond text generation, including planning, file manipulation, multi-step execution loops, sub-agents, context-window management, and the selective use of external tools. These behaviors materially affect real-world outcomes and cannot be inferred from model-level benchmarks alone. Accordingly, Sigmabench measures the end-to-end performance of each agent + model pairing exactly as it is delivered to developers.

The benchmark focuses on metrics chosen for their practical relevance in real software engineering workflows. In addition to measuring accuracy (how often an agent produces an acceptable patch), we also quantify consistency, defined here as the degree to which an agent remains useful when it fails. Even when an agent does not reach an acceptable solution, it may still produce partially correct work that reduces developer effort, and this distinction is important for understanding real-world utility. We also measure speed, reflecting the time required for an agentic system to converge on a result.

The scope of Sigmabench v1 (December 2025) is therefore to provide a structured, reproducible, and interpretable comparison of leading coding agents in conditions that approximate real developer workflows, while intentionally remaining agnostic about the underlying mechanisms each agent uses to deliver its results.

Dataset Construction

Repository Selection and Classification

The Sigmabench dataset is derived from publicly available open-source repositories. Each repository is classified along two dimensions: size and primary programming language. Size is measured in lines of code (LOC) using cloc, with the following categories:

Small: 10–50k LOC
Medium: 50–250k LOC
Large: >250k LOC

The benchmark covers four languages: Python, Java, Go, and JavaScript/TypeScript.

Repositories are selected according to these criteria:

minimum size of 10k LOC
must constitute a legitimate software project (not an asset collection or non-code repository)
at least 80% of source code must be in the primary language
commit messages must follow the Conventional Commits specification

The final dataset is balanced across languages and size categories, comprising 60 repositories: five repositories for each of the twelve language–size combinations.

Commit Selection and Classification

Each repository contributes 15 commits, stratified by the number of added/updated/deleted source files:

Small: 1–3 files
Medium: 4–10 files
Large: 11–25 files

Commits are subject to the following criteria:

no more than 25 source files
changes must primarily involve source code rather than non-code assets
the commit must represent a feature, identified by the feat type in the Conventional Commits schema
the commit must have a single parent (i.e. not a merge commit)

For every repository, five commits are selected for each size category, yielding a uniform distribution across commit sizes for a total of 900 commits.

Golden Diff Generation

For each selected commit, we compute a golden diff: the reference change set that agents are evaluated against. The diff is extracted using git diff --name-status <commit-hash>^!

This representation lists each changed file together with the type of operation (A, M, D, R), and serves as the canonical target for scoring. The golden diff is not shown to agents during evaluation. It is used only as the reference output when computing similarity scores between the agent-produced change set and the golden change set for the commit.

Here is an example golden diff for a large commit on Karpenter:

M   cmd/controller/main.go
M   pkg/apis/apis.go
M   pkg/cloudprovider/amifamily/ami.go
M   pkg/cloudprovider/amifamily/resolver.go
M   pkg/cloudprovider/cloudprovider.go
M   pkg/cloudprovider/instance.go
M   pkg/cloudprovider/launchtemplate.go
M   pkg/cloudprovider/suite_test.go
M   pkg/controllers/controllers.go
A   pkg/controllers/drift/controller.go
A   pkg/controllers/drift/suite_test.go
M   pkg/controllers/interruption/controller.go
A   pkg/fake/cloudprovider.go
M   pkg/fake/ec2api.go
M   pkg/fake/ssmapi.go
M   pkg/utils/utils.go
A   test/suites/drift/suite_test.go

Prompt Generation

Each commit is converted into a task prompt that describes the functional and technical requirements implied by the commit. Prompts are generated automatically using a single, fixed coding agent/model combination to ensure consistency.

Metrics

Similarity with the Golden Diff

Each task instance compares two change sets:

Golden, the ground-truth diff, computed as git diff --name-status <commit>^!
Actual, the diff produced by the agent’s changes

Each diff is parsed into a set of entries of the form:

$\mathrm{operation}\in\{A, M, D\}$ (add, modify, delete; renames are ignored)
$\mathrm{path}$

Let $G$ and $A$ denote the sets of $(\mathrm{operation}, \mathrm{path})$ pairs for the golden and actual diffs.

We define:

$\mathrm{TP} = |A \cap G|$ (exact matches between predicted and golden entries)
$\mathrm{FP} = |A \setminus G|$ (entries introduced by the agent but absent from the golden diff)
$\mathrm{FN} = |G \setminus A|$ (golden entries the agent failed to produce)

To obtain a normalised similarity score in $[0,1]$ , we use the Jaccard Index:

$J = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}}$

This score reflects agreement at the level of discrete file-operations and treats unnecessary edits and missing edits symmetrically.

Acceptable Patch Rate (APR)

The Acceptable Patch Rate is a measure of a coding agent’s accuracy. A task instance is counted as acceptable when its similarity score satisfies $J \geq 0.5$ . The APR is then the proportion of all task instances that meet this criterion.

Partial Patch Rate (PPR)

The Partial Patch Rate is a measure of a coding agent’s consistency. A task instance is counted as partially acceptable when its similarity score satisfies $J \geq 0.2$ . The PPR is then the proportion of task instances that are not acceptable as per APR but do meet the partially acceptable criterion.

Time Utilization Score (TUS)

The Time Utilization Score is a measure of a coding agent’s speed. Each task instance $i$ with a measured runtime of $t_i$ is scored as:

$\mathrm{TUS}_i = 1 - \frac{\min(\ln(t_i),\ \ln(T))}{\ln(T)}$

The value $T$ is the timeout, fixed at 1,200 seconds for all task instances.

Runs that hit the timeout ( $t_i \geq T$ ) receive $\mathrm{TUS}_i = 0$ .
Faster runs receive higher scores, approaching 1 as $t_i$ decreases.
The logarithms capture the multiplicative nature of latency differences: for example, reducing $t_i$ by 50% always increases $\mathrm{TUS}_i$ by the same amount (about 10 percentage points).

For a set of tasks, the aggregate $\mathrm{TUS}$ is defined as the median of the per-instance scores, which reduces sensitivity to outliers and occasional timeouts.

Sigmabench Score (SBS)

The Sigmabench Score (often shortened to Sigmascore) is a composite measure of a coding agent’s overall performance. It is defined as the geometric mean of an agent’s accuracy, consistency and speed scores:

$\mathrm{SBS} = (\mathrm{APR}\ \cdot \mathrm{PPR}\ \cdot \mathrm{TUS})^\frac{1}{3}$

The geometric mean captures the orthogonality and independent necessity of each metric. A deficiency in any one dimension materially limits real-world usefulness. In particular:

low $\mathrm{APR}$ $\rightarrow$ the agent does not produce sufficiently accurate patches
low $\mathrm{PPR}$ $\rightarrow$ the agent is inconsistent, often producing totally unusable results
low $\mathrm{TUS}$ $\rightarrow$ the agent disrupts developer workflow by being too slow or timing out

Because the geometric mean penalizes imbalance, an agent must perform reasonably well across all three areas to achieve a high overall score.

Benchmark Procedure

Each benchmark run evaluates an agent/model combination on a curated set of tasks derived from the Sigmabench dataset. A task consists of a repository materialized in a synthetic, single-commit git history: we check out the parent commit of the selected change, then construct a synthetic git repository containing only that parent commit. This prevents contamination from the full project history and ensures that agents cannot exploit information outside of the intended context. The agent receives this synthetic repo along with the task prompt describing the required modification.

Execution Environment and Version Freezing

All agents run inside an isolated container with a standardized filesystem layout and a minimal, language-appropriate toolchain. To ensure fair comparison, every benchmark run freezes the versions of all agent harnesses, CLIs, and auxiliary tooling at a specific date (the benchmark “freeze date”). This creates a stable and fair execution baseline, eliminating variability introduced by upstream agent or tool updates.

The environment allows agents to read and modify files but does not guarantee parity with each project’s intended development setup. In order to prevent agents from wasting time installing missing software, the agent is instructed to use only the tools that are present and not to attempt to install additional software.

Agent Interaction Model and Reasoning Effort

Agents operate in iterative cycles in which they may inspect files, run tools, access the internet, and modify files. Each task provides the agent with a fixed time and compute budget.

Sigmabench does not override or normalize model-level reasoning modes. Instead, the benchmark uses the default reasoning effort specified by each agent harness (e.g., medium for Codex CLI with GPT‑5.1-Codex-Max). These defaults are frozen and reported so that users can interpret results in the context of the agent’s native configuration.

The agent’s final output is the diff it induces via file modifications; it is captured through git diff --name-status at task completion or timeout.

Scoring and Outcome Recording

For each task, the benchmark records:

the diff produced by the agent,
timing information,
an exit condition (success, error, or timeout),
diagnostic logs relevant to agent behavior.

Statistical Methods

Sigmabench quantifies uncertainty in all metrics using nonparametric bootstrap resampling. For each agent/model combination, we draw 5,000 bootstrap samples, each consisting of a resampling (with replacement) of the full set of tasks. For each resample we recompute the metric, producing a distribution of 5,000 bootstrap estimates. The 95% confidence interval is taken as the 2.5th and 97.5th percentiles of this empirical distribution. This approach makes no assumptions about underlying metric distributions.

Agent ranking requires an additional rule to balance statistical caution with practical discriminability. Sigmabench assigns ranks by comparing the midpoint (bootstrap mean) of one agent’s metric with the upper bound of the 95% confidence interval of another. Agent A is considered strictly better than Agent B if A’s midpoint exceeds B’s upper bound. If intervals overlap under this rule, the agents are assigned the same rank. This relaxed dominance criterion avoids over-interpreting small differences while still surfacing meaningful performance separation where the data supports it.

To improve interpretability, Sigmabench uses both numerical ranks and tiers. Numerical ranks reflect strict ordering under the dominance criterion described above: agents deemed indistinguishable are assigned the same rank, and subsequent ranks are incremented by the size of the tied group (e.g., $[1, 2, 3, 3, 3, 3, 7]$ ). This preserves the numerical position implied by the comparisons. Tiers, by contrast, group agents by statistical indistinguishability without including gaps in ordering: tied agents share the same tier, and tier numbers increase sequentially regardless of group size (e.g., $[1, 2, 3, 3, 3, 3, 4]$ ). Ranks emphasize relative position, while tiers emphasize practical equivalence. For those reasons, Sigmabench emphasizes tiers over ranks in most cases.

Limitations and Notes

Sigmabench is designed to provide a consistent and reproducible assessment of agentic coding performance, but several limitations are inherent to our methodology.

LLM-generated task prompts. Task prompts are produced by an LLM via reverse-engineering from the golden diff rather than being human-authored. This introduces a degree of noise, since prompt phrasing and emphasis may differ from how a human would frame the same task. The approach ensures procedural consistency across all commits, but it also means that prompts are more synthetic than those seen in real workflows.

File-level scoring rather than code-content scoring. The benchmark evaluates whether the agent applies the correct operations to the correct set of files rather than whether the resulting code compiles, runs, or matches the exact patch contents. This aligns the metric with structural correctness—i.e., whether the agent understood where the work belongs—rather than line-level accuracy. However, it also means Sigmabench does not detect semantic or syntactic correctness issues within files.

Uniform, non-specialized toolchains. Agents operate inside a standardized container that provides a minimal toolchain covering the major languages, but it is not tailored to the exact build, linting, or testing environment of each individual repository. As a result, some agents may struggle with missing tools that would normally be present in a real project setup. Because this constraint is uniform across all agents, we expect its impact on relative rankings to be limited.

Open-source–only dataset. Sigmabench is constructed exclusively from publicly available open-source repositories. These projects differ in meaningful ways from the broader population of proprietary or commercial codebases: they tend to have clearer modular boundaries, more consistent formatting, stronger linting or CI conventions, and more frequent small-scope commits. Conversely, many industry codebases contain legacy components, inconsistent style, monolithic modules, or domain-specific tooling not well represented in open source.

Unrepresented programming languages. Several widely used programming languages do not appear in Sigmabench v1. This is not due to lack of interest, but rather to the difficulty of finding open-source repositories that satisfy all dataset criteria across the required size categories. In practice, languages such as PHP, the broader .NET ecosystem (C#, F#, VB.NET), and C/C++ rarely met the combination of size, commit-structure, and Conventional Commit usage needed for inclusion. Their absence means that benchmark results may not generalize to codebases primarily written in these languages.

Lack of interactive agent evaluation. Many coding agents support interactive workflows in which users iteratively provide feedback, clarify requirements, or guide the agent through multi-step reasoning. Sigmabench v1 does not evaluate this mode of use. All benchmark runs follow a fixed, non-interactive pattern: a single task prompt, one autonomous execution loop, and a final output. This setup allows for consistent comparison across agents, but it does not capture the potential performance differences that may emerge during interactive, human-in-the-loop sessions.

Restriction to CLI-based agents. Sigmabench v1 restricts evaluation to command-line–based coding agents. This decision was driven primarily by practical considerations: CLI agents allow for fully automated, reproducible execution in a controlled environment, whereas graphical or browser-based agents require additional setup, state management, and interaction scaffolding that were deemed out of scope for the first version of the benchmark. As a result, Sigmabench v1 does not capture performance characteristics specific to GUI-driven workflows. Future versions of the benchmark may expand coverage to include non-CLI agents as the methodology and infrastructure evolve.

If you would like to dive deeper feel free to contacts us.

Explore our methodology for coding agent evaluation.