Methodology versioning
This page documents the Sigmabench methodology v1, frozen as of December 2025.
Future releases of Sigmabench will publish updated methodology documents as v2, v3, etc. For now, there is only v1.
Purpose and Scope
Sigmabench is designed to evaluate agentic coding systems—the combination of a coding agent’s execution harness and its underlying language model—rather than language models in isolation. Modern coding agents incorporate substantial logic beyond text generation, including planning, file manipulation, multi-step execution loops, sub-agents, context-window management, and the selective use of external tools. These behaviors materially affect real-world outcomes and cannot be inferred from model-level benchmarks alone. Accordingly, Sigmabench measures the end-to-end performance of each agent + model pairing exactly as it is delivered to developers.
The benchmark focuses on metrics chosen for their practical relevance in real software engineering workflows. In addition to measuring accuracy (how often an agent produces an acceptable patch), we also quantify consistency, defined here as the degree to which an agent remains useful when it fails. Even when an agent does not reach an acceptable solution, it may still produce partially correct work that reduces developer effort, and this distinction is important for understanding real-world utility. We also measure speed, reflecting the time required for an agentic system to converge on a result.
The scope of Sigmabench v1 (December 2025) is therefore to provide a structured, reproducible, and interpretable comparison of leading coding agents in conditions that approximate real developer workflows, while intentionally remaining agnostic about the underlying mechanisms each agent uses to deliver its results.
Dataset Construction
Repository Selection and Classification
The Sigmabench dataset is derived from publicly available open-source repositories. Each repository is classified along
two dimensions: size and primary programming language. Size is measured in lines of code (LOC) using cloc,
with the following categories:
- Small: 10–50k LOC
- Medium: 50–250k LOC
- Large: >250k LOC
The benchmark covers four languages: Python, Java, Go, and JavaScript/TypeScript.
Repositories are selected according to these criteria:
- minimum size of 10k LOC
- must constitute a legitimate software project (not an asset collection or non-code repository)
- at least 80% of source code must be in the primary language
- commit messages must follow the Conventional Commits specification
The final dataset is balanced across languages and size categories, comprising 60 repositories: five repositories for each of the twelve language–size combinations.
Commit Selection and Classification
Each repository contributes 15 commits, stratified by the number of added/updated/deleted source files:
- Small: 1–3 files
- Medium: 4–10 files
- Large: 11–25 files
Commits are subject to the following criteria:
- no more than 25 source files
- changes must primarily involve source code rather than non-code assets
- the commit must represent a feature, identified by the
feattype in the Conventional Commits schema - the commit must have a single parent (i.e. not a merge commit)
For every repository, five commits are selected for each size category, yielding a uniform distribution across commit sizes for a total of 900 commits.
Golden Diff Generation
For each selected commit, we compute a golden diff: the reference change set that agents are evaluated against. The
diff is extracted using git diff --name-status <commit-hash>^!
This representation lists each changed file together with the type of operation (A, M, D, R), and serves as the
canonical target for scoring. The golden diff is not shown to agents during evaluation. It is used only as the reference
output when computing similarity scores between the agent-produced change set and the golden change set for the commit.
Here is an example golden diff for a large commit on Karpenter:
M cmd/controller/main.go
M pkg/apis/apis.go
M pkg/cloudprovider/amifamily/ami.go
M pkg/cloudprovider/amifamily/resolver.go
M pkg/cloudprovider/cloudprovider.go
M pkg/cloudprovider/instance.go
M pkg/cloudprovider/launchtemplate.go
M pkg/cloudprovider/suite_test.go
M pkg/controllers/controllers.go
A pkg/controllers/drift/controller.go
A pkg/controllers/drift/suite_test.go
M pkg/controllers/interruption/controller.go
A pkg/fake/cloudprovider.go
M pkg/fake/ec2api.go
M pkg/fake/ssmapi.go
M pkg/utils/utils.go
A test/suites/drift/suite_test.go
Prompt Generation
Each commit is converted into a task prompt that describes the functional and technical requirements implied by the commit. Prompts are generated automatically using a single, fixed coding agent/model combination to ensure consistency.
The agent receives the following instruction template:
Run `git show HEAD`, look at the changes introduced and describe them in technical and
functional terms in about ~250 words. Describe the changes in terms of functional (HOW it
should work) and technical requirements (WHAT should be added/modified and HOW) but
WITHOUT referencing file names, directory names, class names, or any other symbol name in
the code. Do not describe the changes as if they had been made already but rather present
the requirements as if they had not been implemented yet and must now be implemented. Use
the imperative form and structure your output as an itemized list of requirements. Do not
include an introductory sentence. Start directly with the requirements list.
Here's an example task prompt, for the same commit as the previous section:
- Introduce a capability to continuously monitor existing compute resources and identify
those whose underlying configuration, such as their base operating system image, has
diverged from the desired specification.
- Implement a control mechanism that allows operators to enable or disable this
configuration divergence detection feature. When enabled, resources found to be out of
sync should be clearly flagged.
- Develop a component that can parse and extract unique identifiers for individual compute
instances from their platform-specific resource identifiers, ensuring robust handling of
various identifier formats and potential parsing errors.
- Integrate a simulated external infrastructure interface for testing purposes. This
interface must allow for defining specific responses regarding the state and
configuration of compute resources, enabling comprehensive validation of the divergence
detection logic in isolation.
- Extend the test suite with new scenarios to verify the correctness of the configuration
divergence detection, including cases where resources are correctly aligned, cases where
they have diverged due to an outdated or invalid base operating system image, and
scenarios where the detection feature is explicitly disabled.
Metrics
Similarity with the Golden Diff
Each task instance compares two change sets:
- Golden, the ground-truth diff, computed as
git diff --name-status <commit>^! - Actual, the diff produced by the agent’s changes
Each diff is parsed into a set of entries of the form:
- (add, modify, delete; renames are ignored)
Let and denote the sets of pairs for the golden and actual diffs.
We define:
- (exact matches between predicted and golden entries)
- (entries introduced by the agent but absent from the golden diff)
- (golden entries the agent failed to produce)
To obtain a normalised similarity score in , we use the Jaccard Index:
This score reflects agreement at the level of discrete file-operations and treats unnecessary edits and missing edits symmetrically.
Acceptable Patch Rate (APR)
The Acceptable Patch Rate is a measure of a coding agent’s accuracy. A task instance is counted as acceptable when its similarity score satisfies . The APR is then the proportion of all task instances that meet this criterion.
Partial Patch Rate (PPR)
The Partial Patch Rate is a measure of a coding agent’s consistency. A task instance is counted as partially acceptable when its similarity score satisfies . The PPR is then the proportion of task instances that are not acceptable as per APR but do meet the partially acceptable criterion.
Time Utilization Score (TUS)
The Time Utilization Score is a measure of a coding agent’s speed. Each task instance with a measured runtime of is scored as:
The value is the timeout, fixed at 1,200 seconds for all task instances.
- Runs that hit the timeout () receive .
- Faster runs receive higher scores, approaching 1 as decreases.
- The logarithms capture the multiplicative nature of latency differences: for example, reducing by 50% always increases by the same amount (about 10 percentage points).
For a set of tasks, the aggregate is defined as the median of the per-instance scores, which reduces sensitivity to outliers and occasional timeouts.
Sigmabench Score (SBS)
The Sigmabench Score (often shortened to Sigmascore) is a composite measure of a coding agent’s overall performance. It is defined as the geometric mean of an agent’s accuracy, consistency and speed scores:
The geometric mean captures the orthogonality and independent necessity of each metric. A deficiency in any one dimension materially limits real-world usefulness. In particular:
- low the agent does not produce sufficiently accurate patches
- low the agent is inconsistent, often producing totally unusable results
- low the agent disrupts developer workflow by being too slow or timing out
Because the geometric mean penalizes imbalance, an agent must perform reasonably well across all three areas to achieve a high overall score.
Benchmark Procedure
Each benchmark run evaluates an agent/model combination on a curated set of tasks derived from the Sigmabench dataset. A task consists of a repository materialized in a synthetic, single-commit git history: we check out the parent commit of the selected change, then construct a synthetic git repository containing only that parent commit. This prevents contamination from the full project history and ensures that agents cannot exploit information outside of the intended context. The agent receives this synthetic repo along with the task prompt describing the required modification.
Execution Environment and Version Freezing
All agents run inside an isolated container with a standardized filesystem layout and a minimal, language-appropriate toolchain. To ensure fair comparison, every benchmark run freezes the versions of all agent harnesses, CLIs, and auxiliary tooling at a specific date (the benchmark “freeze date”). This creates a stable and fair execution baseline, eliminating variability introduced by upstream agent or tool updates.
The environment allows agents to read and modify files but does not guarantee parity with each project’s intended development setup. In order to prevent agents from wasting time installing missing software, the following prompt preamble is prepended to the task prompt:
<execution-environment>
You are running inside a locked-down Docker container with a fixed set of development
tools. A limited selection of compilers, linters, and language runtimes is available, but
some tools may be missing. Use only the tools that are present. Do not attempt to install,
download, or fetch additional software or dependencies; such operations will fail.
Complete the task using only the capabilities available in this environment.
</execution-environment>
Agent Interaction Model and Reasoning Effort
Agents operate in iterative cycles in which they may inspect files, run tools, access the internet, and modify files. Each task provides the agent with a fixed time and compute budget.
Sigmabench does not override or normalize model-level reasoning modes. Instead, the benchmark uses the default reasoning effort specified by each agent harness (e.g., medium for Codex CLI with GPT‑5.1-Codex-Max). These defaults are frozen and reported so that users can interpret results in the context of the agent’s native configuration.
The agent’s final output is the diff it induces via file modifications; it is captured through git diff --name-status
at task completion or timeout.
Scoring and Outcome Recording
For each task, the benchmark records:
- the diff produced by the agent,
- timing information,
- an exit condition (success, error, or timeout),
- diagnostic logs relevant to agent behavior.
Statistical Methods
Sigmabench quantifies uncertainty in all metrics using nonparametric bootstrap resampling. For each agent/model combination, we draw 5,000 bootstrap samples, each consisting of a resampling (with replacement) of the full set of tasks. For each resample we recompute the metric, producing a distribution of 5,000 bootstrap estimates. The 95% confidence interval is taken as the 2.5th and 97.5th percentiles of this empirical distribution. This approach makes no assumptions about underlying metric distributions.
Agent ranking requires an additional rule to balance statistical caution with practical discriminability. Sigmabench assigns ranks by comparing the midpoint (bootstrap mean) of one agent’s metric with the upper bound of the 95% confidence interval of another. Agent A is considered strictly better than Agent B if A’s midpoint exceeds B’s upper bound. If intervals overlap under this rule, the agents are assigned the same rank. This relaxed dominance criterion avoids over-interpreting small differences while still surfacing meaningful performance separation where the data supports it.
To improve interpretability, Sigmabench uses both numerical ranks and tiers. Numerical ranks reflect strict ordering under the dominance criterion described above: agents deemed indistinguishable are assigned the same rank, and subsequent ranks are incremented by the size of the tied group (e.g., ). This preserves the numerical position implied by the comparisons. Tiers, by contrast, group agents by statistical indistinguishability without including gaps in ordering: tied agents share the same tier, and tier numbers increase sequentially regardless of group size (e.g., ). Ranks emphasize relative position, while tiers emphasize practical equivalence. For those reasons, Sigmabench emphasizes tiers over ranks in most cases.
Limitations and Notes
Sigmabench is designed to provide a consistent and reproducible assessment of agentic coding performance, but several limitations are inherent to our methodology.
LLM-generated task prompts. Task prompts are produced by an LLM via reverse-engineering from the golden diff rather than being human-authored. This introduces a degree of noise, since prompt phrasing and emphasis may differ from how a human would frame the same task. The approach ensures procedural consistency across all commits, but it also means that prompts are more synthetic than those seen in real workflows.
File-level scoring rather than code-content scoring. The benchmark evaluates whether the agent applies the correct operations to the correct set of files rather than whether the resulting code compiles, runs, or matches the exact patch contents. This aligns the metric with structural correctness—i.e., whether the agent understood where the work belongs—rather than line-level accuracy. However, it also means Sigmabench does not detect semantic or syntactic correctness issues within files.
Uniform, non-specialized toolchains. Agents operate inside a standardized container that provides a minimal toolchain covering the major languages, but it is not tailored to the exact build, linting, or testing environment of each individual repository. As a result, some agents may struggle with missing tools that would normally be present in a real project setup. Because this constraint is uniform across all agents, we expect its impact on relative rankings to be limited.
Open-source–only dataset. Sigmabench is constructed exclusively from publicly available open-source repositories. These projects differ in meaningful ways from the broader population of proprietary or commercial codebases: they tend to have clearer modular boundaries, more consistent formatting, stronger linting or CI conventions, and more frequent small-scope commits. Conversely, many industry codebases contain legacy components, inconsistent style, monolithic modules, or domain-specific tooling not well represented in open source.
Unrepresented programming languages. Several widely used programming languages do not appear in Sigmabench v1. This is not due to lack of interest, but rather to the difficulty of finding open-source repositories that satisfy all dataset criteria across the required size categories. In practice, languages such as PHP, the broader .NET ecosystem (C#, F#, VB.NET), and C/C++ rarely met the combination of size, commit-structure, and Conventional Commit usage needed for inclusion. Their absence means that benchmark results may not generalize to codebases primarily written in these languages.
Lack of interactive agent evaluation. Many coding agents support interactive workflows in which users iteratively provide feedback, clarify requirements, or guide the agent through multi-step reasoning. Sigmabench v1 does not evaluate this mode of use. All benchmark runs follow a fixed, non-interactive pattern: a single task prompt, one autonomous execution loop, and a final output. This setup allows for consistent comparison across agents, but it does not capture the potential performance differences that may emerge during interactive, human-in-the-loop sessions.
Restriction to CLI-based agents. Sigmabench v1 restricts evaluation to command-line–based coding agents. This decision was driven primarily by practical considerations: CLI agents allow for fully automated, reproducible execution in a controlled environment, whereas graphical or browser-based agents require additional setup, state management, and interaction scaffolding that were deemed out of scope for the first version of the benchmark. As a result, Sigmabench v1 does not capture performance characteristics specific to GUI-driven workflows. Future versions of the benchmark may expand coverage to include non-CLI agents as the methodology and infrastructure evolve.