Designing Self-evolving Coding Agents for Terminal-Bench 2.0
Designing Self-evolving Coding Agents for Terminal-Bench 2.0: A Systematic Exploration
Abstract. We present a systematic exploration of designing self-evolving coding agents for Terminal-Bench 2.0, a benchmark of 89 diverse command-line tasks. Starting from a vanilla ReAct agent powered by Claude Opus 4.6 (65.91%), we conduct large-scale controlled experiments across six groups — establishing baselines, analyzing the performance ceiling (pass@5: 85.88%), exploring four harness components (prompt, memory, tools, skills), refining skill evolution strategies, and developing self-verification mechanisms. Our investigation reveals that among all mutable components, skill evolution delivers the consistent improvement, and nudged self-verification mechanism further boosts the performance. Altogher, our best self-evolution algorithm achieves 76.47%, a +10.56pp improvement over the Opus 4.6 vanilla baseline, ranked top #7 in the terminal bench 2.0 leaderboard.
1. Introduction
1.1 Terminal-Bench 2.0
Terminal-Bench 2.0 (TB2) is a benchmark of 89 diverse command-line tasks spanning system administration, scientific computing, security, data science, and software engineering (E.g., configure web servers, Bayesian networks analysis, ML model training, git history sanitization, and compile software from source). Each task provides a Docker container with a pre-configured environment, a natural language task description, and a hidden test script that validates the solution. Task difficulty ranges widely — from 5-minute file edits to 60-minute multi-step system configurations — with per-task timeouts.
1.2 Agent Architecture
Our solver agent (i.e., Coding agent) employs two agent frameworks: (1) Strands Agents + AgentCore; (2) A standard ReAct loop + Sandbox/Docker. The solver has access to core tools:
bash(): Execute shell commands in the Docker container (60s timeout per call)python(): Execute Python code in an isolated interpretersubmit("DONE"): Signal task completion
The system prompt follows an Analysis/Plan format that structures the agent’s reasoning before each tool call.
1.3 The Evolution Framework (A-Evolve)
A-Evolve deploys a trajectory-only in-situation evolution loop as follows:
Solver Agent (bash/python/submit tools)
↓ solves tasks in Docker containers
Trajectory Observer (JSONL logs)
↓ records tool calls, outputs, ...
LLM Judge (scores 0-10 without ground-truth)
↓ evaluates trajectories/identify potential failure patterns/...
Evolver Agent (LLM agent accessing bash tool to modify the agent harness)
↓ mutates prompts/skills/tools/memories based on observations
Skill Library (lazy-loading)/Prompt/Tool Registry/Memory
↓ agent reads skills on demand/loads updated prompts/...
Next Batch's Solver
The key design constraint is that evolution operates without ground-truth feedback. The evolver never sees test results — it must infer what went wrong from the trajectory alone. This makes the problem significantly harder than test-guided optimization.
1.4 Experiment Metrics
For solver agent, we measure: (1) pass rate – resolved/total (%); (2) AVG turns; (3) AVG tool calls; (4) AVG used of skills; (5) Running time. For evolver agent, we measure: (1) number of created skills/tools/memories; (2) Changed prompt/memory chars.
1.5 Experiment Notation
We conduct our experiments in a principled way to systematically build a self-evolving coding agent. Particularly, we design the following chronologically thematic groups. Each group is labeled with a letter prefix, and individual experiments within a group are numbered sequentially:
| Group | Focus | Experiments | Section |
|---|---|---|---|
| A | Baseline establishment | A1–A2 | Section 2 |
| B | Ceiling analysis and task classification | B1a–B1c, B2, B3 | Section 3 |
| C | Harness component exploration (prompt, memory, tools, skills, combinations) | C1a–C1b, C2a–C2b, C3a–C3c, C4a–C4b, C5a–C5g | Section 4 |
| D | Focused skill evolution strategies (size, count, generation, filtering, variance) | D1a–D1g, D2a–D2d, D3a–D3d, D4a–D4d, D5a–D5e | Section 5 |
| E | Self-verification test-time scaling | E1a–E1f, E2a–E2c, E3a–E3c | Section 6 |
1.6 Infrastructure: API reliability and context management
Our agent infrastructure addresses two key challenges: API reliability and context management. For API reliability, running 20 concurrent AWS Bedrock API calls requires robust error handling — we implement transient error retries (covering read timeouts, content filtering false positives, and connection hangs), timeout clock pausing during retry waits so that idle retry time does not consume the task’s budget, and a wall-clock safety cap to prevent infinite retry loops on persistent outages. For context management, we employ a summarization-based approach: when the conversation reaches the context window limit, we summarize the preceding conversation history into a condensed representation and continue from that summary, preserving critical state while freeing context capacity for continued reasoning.
2. Establishing the Baseline (Group A)
2.1 Opus 4.5 Baseline (A1)
Our initial experiments used Claude Opus 4.5. The vanilla ReAct coding agent scored AVG 60.15% across 5 runs (For simplicity, by default, all results are reported as AVG in 5 runs while some use 3 runs due to the API quota limit).
| Metric | A1: Opus 4.5 |
|---|---|
| Score | AVG 60.15% |
| Mean turns (all tasks) | 72.84 |
| Mean input tokens/task | 1,081,840 |
| Mean output tokens/task | 15,689 |
| Prompt length | 1,368 chars |
| Skills loaded | 0 (no loading mechanism) |
2.2 Opus 4.6 Baseline (A2)
When Claude Opus 4.6 was released (February 5, 2026), we used the same ReAct workflow and observed a significant capability jump:
| Metric | A1: Opus 4.5 | A2: Opus 4.6 | Change |
|---|---|---|---|
| Score | AVG 60.15% | AVG 65.91% | +5.76pp |
| Mean turns (all) | 72.84 | 60.68 | -16.70% |
| Mean input tokens/task | 1,081,840 | 670,784 | -38.01% |
| Mean output tokens/task | 15,689 | 13,910 | -11.34% |
| Prompt length | 1,368 chars | 1,368 chars | Same |
| Skills loaded | 0 | 0 | Same |
Finding: Opus 4.6 solves tasks with 38.01% fewer input tokens and 16.70% fewer turns, while achieving a higher pass rate. The model is substantially more efficient and capable. In particular, we find Opus 4.6 will automatically self-verify the results.
3. Ceiling Analysis and Task Classification (Group B)
Before attempting to improve performance, we needed to (1) estimate the upper bound of the performance by computing pass@5 from A2 (B1a), increasing the per-task running time limit (B1b), and providing the environment signal in the ReAct loop through comparing the generated artifact and the ground-truth results (B1c). (2) understand the benchmark’s structure: which tasks are deterministically solvable, which are beyond the model’s capability, and which are stochastic. Then, we focus on these stochastic flaky tasks for improvement.
3.1 Ceiling Analysis (B1)
| Experiment | Design details | Performance |
|---|---|---|
| B1a | Pass@5 across 5 independent A2 runs (upper bound of flaky stabilization) | 85.88% |
| B1b | 2× per-task timeout (1,800s–7,200s) | 68.12% |
| B1c | Environment signal: compare generated artifact against ground-truth output in the ReAct loop | 66.23% |
Finding: From B1a, we obtained the north star of Terminal-Bench 2.0 to guide our exploration (i.e., 85.88%). Meanwhile, we found providing environment signals does not help much in the coding tasks from B1c. From B1b, increasing the running time can help by ~3pp, indicating TB2 is a resource-heavy benchmark for evaluation.
3.2 Task Stability Classification (B2)
Across 5 independent Opus 4.6 runs, we classified all 89 tasks by their pass rate:
| Category | Count | Pass Rate | Description |
|---|---|---|---|
| Always pass | 52 | 100.00% | Deterministic — model capability sufficient |
| Flaky (40–80%, unstable) | 11 | Pass 2-4/5 times | Likely solvable with small improvements |
| Flaky (20% Rare pass) | 10 | Pass 1/5 times | hard but occasionally solvable |
| Always fail | 16 | 0.00% | Beyond current capability |
Finding for the theoretical ceiling: If all flaky tasks could be stabilized, the score would reach 85.88%. The 16 always-fail tasks represent a hard capability boundary.
3.3 Failure Pattern Taxonomy (B3)
Deep trajectory analysis — comparing successful and failed runs of the same task side by side — revealed distinct failure patterns as Finding:
| Pattern | Frequency | Description | Challenge |
|---|---|---|---|
| Premature convergence | High | Model locks into a wrong approach early and never backtracks | Moderate |
| Semantic reasoning error | Medium | Model misreads task requirements | Moderate |
| Hidden eval constraints | Medium | Task has implicit requirements not in the description | Low |
| Algorithm stochasticity | High | Equally plausible approaches differ in robustness | High |
These failure patterns help guide our next-stage experiment design for harness improvement (i.e., workflow, prompt, memory, tool, and skill).
4. Exploring Improvement Directions (Group C)
With the baseline, north star, and flaky tasks identified, we systematically explored four workspace components: tools, memory, system prompt, and skills.
4.1 System Prompt: Behavioral Strategy Evolution (C1)
We restricted the evolver to modifying only the system prompt to address the failure cases in B3, e.g., adding behavioral strategy rules like “Comprehensively understand the task and think various directions before starting to solve the problem.”.
| Experiment | Approach | Score | Prompt Length |
|---|---|---|---|
| C1a | Prompt-only evolution with lightweight guidance | 66.02% | 1,368 -> 1,940 chars |
| C1b | Prompt-only evolution with tailored guidance from B3 | 60.13% | 1,368 -> 4,940 chars |
Finding: C1a scored identically to vanilla A2 while longer and complex prompts hurt in C1b. The strategy rules are qualitatively sensible but quantitatively inert — Opus 4.6 already implements these strategies implicitly.
4.2 Memory: Episodic Learning Across Tasks (C2)
We tested memory injection — episodic observations from previous tasks used in future tasks by full loading. Memories accumulated across evolution cycles without providing task-relevant information.
| Experiment | Approach | Score | Memory Length |
|---|---|---|---|
| C2a | Memory-only evolution with memory budgets | 65.11% | 1,431 chars |
| C2b | Memory-only evolution without limitations | 60.46% | 2,770 chars |
Finding: Memory injection was consistently neutral or slightly negative. The evolver stored generic process observations (“always submit your answer,” “save intermediate results”) — advice the model already knows.
4.3 Tools: Universal vs Task-Specific (C3)
We tested the impact of adding pre-defined tools and self-generated tools. The pre-defined tool set is Bash, Python, and Submit. In the self-evolving stage, the evolver agent generates task-specific tools from the task trajectories.
| Experiment | Approach | Score | Notes |
|---|---|---|---|
| C3a | Bash + Python | 58.12% | |
| C3b | Bash + Python + Submit | 65.91% | the same setting as A2 |
| C3c | Bash + Python + Submit + Self-generated tools (e.g., SecurityFileCheck(), Verify()) | 59.37% |
Finding: (1) Submit tool helps because agent does not know when to stop. In the trajectory in C3a, we found agent sometimes continus solving the problem without submitting the results; (2) Self-generated task-specific tools hurt in the terminal/coding tasks. The similar coding tasks have been heavily trained in post-training stages only using universal tools (e.g., grep, bash). (3) The verification logic needs to be richer than what a tool interface naturally supports. The agent often called verify() but performed superficial checks.
4.4 Skills: The Most Promising Direction (C4)
We examine the influence of self-generated skills during evolutions. Skills serve as the domain-specific knowledge and provide guidance for agents when they solve similar tasks. These skills are used via lazy loading.
| Experiment | Approach | Score | Skill length |
|---|---|---|---|
| C4a | Skill-only evolution with skill count budgets | 70.58% | 3,012 characters |
| C4b | Skill-only evolution without budgets | 68.23% | 8,122 characters |
Finding: (1) In C4a, after evolver generated 3 seed skills, the agent chose to read at least one skill on 32.94% of tasks in the remaining tasks. (2) Even with more skills in C4b, we still witness consistent gain from skills. This indicates that skills delivered a measurable improvement when the skills were well-targeted, unlike tools, memory, or prompt changes.
4.5 Auto-Harness: The combination of prompt, tool, memory, and skill (C5)
Furthermore, we test the combination of agent harness. Based on the experiments C1-C4, we know skills help most and the lazy-loading mode will not hurt the performance much. On top of the skill, we explore different harness combination for deep-dive.
| Experiment | Approach | Score | Note |
|---|---|---|---|
| C5a | Skill + Prompt | 67.11% | |
| C5b | Skill + Tool | 66.23% | |
| C5c | Skill + Memory | 67.02% | |
| C5d | Skill + Prompt + Tool | 69.13% | |
| C5e | Skill + Prompt + Memory | 67.29% | |
| C5f | Skill + Tool + Memory | 64.19% | |
| C5g | Skill + Prompt + Tool + Memory | 65.19% |
Finding: From the series of experiments in C5, we found that skills+{Prompt, tool, memory} hurts except C5d. This indicates that for the terminal bench 2.0, we should focus on the skill evolution for the north star shown in B1 experiment.
5. Skill Evolution Strategies (Group D)
For skill, it employ the lazy loading strategy. Particularly, we only list only skill names and one-line descriptions in the system prompt. The agent calls read_skill(name) to load full content on demand.
5.1 Skill Size Optimization (D1)
We examine the impact of skill size:
| Experiment | Approach | Score | Note |
|---|---|---|---|
| D1a | 0~500 chars | 67.21% | |
| D1b | 500~1000 chars | 66.73% | |
| D1c | 1000~1500 chars | 68.91% | |
| D1d | 1500~2000 chars | 71.23% | |
| D1e | 2000~2500 chars | 70.02% | |
| D1f | 2500~3000 chars | 68.43% | |
| D1g | 3000~3500 chars | 67.19% |
Finding: Skill size exhibits a clear sweet spot at 1,500–2,000 characters (D1d, 71.23%). Skills shorter than 1,000 characters lack sufficient domain detail to be useful, while skills exceeding 2,500 characters dilute the signal with redundant content. The drop-off is roughly symmetric: -4.02pp below the sweet spot (D1a) and -4.04pp above it (D1g). This guides our skill creation constraints in subsequent experiments.
5.2 Skill count budget and Per-Task Keyword Selection (D2)
We investigate the number of added skills. Besides, we also have our mitigation: match skill keywords against the task description, only inject skills with top-k relevance score ranked by N-gram overlapping.
| Experiment | Skill Delivery | Score | |
|---|---|---|---|
| D2a | with skill count budgets of 3 | 70.58% | |
| D2b | with skill count budgets of 6 | 72.21% | |
| D2c | without skill count budgets | 68.23% | |
| D2d | without skill count budgets but with per-task keyword selection | 71.94% |
Finding: A moderate skill budget of 6 (D2b) achieves the best score at 72.21%, outperforming both the tighter budget of 3 (D2a, 70.58%, with ratio of tasks using skills as 32.94%) and the unbounded setting (D2c, 68.23%). Unbounded skill injection degrades performance because the agent loads irrelevant skills that distract from the task. Per-task keyword selection (D2d, 71.94%) partially mitigates the unbounded problem by filtering skills for relevance, recovering most of the gain without a hard count limit.
5.3 Solver-Proposed Skills (D3)
We tested having the solver (with full environment context) propose skill drafts after each task, with the evolver judging the skills and curating proposals.
| Experiment | Skill Generation | Score | Note |
|---|---|---|---|
| D3a | Evolver creates from trajectories | 72.21% | |
| D3b | Solver creates, evolver refines | 67.35% | |
| D3c | Solver creates, evolver generalizes | 69.16% | |
| D3d | Evolver creates and evolver judges and refines | 70.37% |
Finding: “Solver creates, evolver refines” (D3b) has the lowest score. The solver excels at extracting task-specific knowledge but fails at generalization. The evolver merged hyper-specific drafts into existing skills, replacing useful content with narrow tips like “ICS calendar parsing” and “OCR configuration.” “Solver creates, evolver generalizes” (D3c) partially recovered by instructing the evolver to “never shrink” existing skills and to generalize across them. However, the best result still comes from the evolver creating skills directly from trajectories (D3a). Lesson: The bottleneck is not knowledge extraction but knowledge generalization — the solver cannot reliably identify what transfers to unseen tasks.
5.4 Case study: The Process-Advice Trap (D4)
When we analyzed the long-horizon trajectory and read the skill content in D1-D3 experiments, we found that the evolver gravitates toward generic process advice. Without constraints, it produces skills like “always check if packages are installed” and “handle timeouts by trying alternatives.” We introduced a FORBIDDEN/REQUIRED blocklist as follows:
FORBIDDEN: Timeout handling, pip tips, session persistence, generic debugging
REQUIRED: Domain-specific libraries, verification steps, domain pitfalls ONLY
| Experiment | Skill guidance | Score | Note |
|---|---|---|---|
| D4a | Generic skill creation guidance in evolver agent | 72.21% | |
| D4b | + negative prompting | 73.17% | |
| D4c | + positive steering | 70.36% | |
| D4d | + negative prompting + positive steering | 73.21% |
Finding: Negative prompting (D4b, blocking bad outputs) was more effective than positive steering (D4c). The combination (D4d) achieves the highest score but only marginally above negative prompting alone (+0.04pp), suggesting most of the gain comes from the blocklist. Without the blocklist, skills accumulate generic process advice; with it, they stay focused on domain-specific knowledge.
5.5 Case Study: Skill Evolution Non-Determinism (D5)
Similar to the non-determinism in LLM response generation and agent execution, we examine the the non-determinism in skill evolution. The five evolution runs from the identical seed. The experiment is shown below.
| Experiment | Different content of exampler debugging skill | Different content of scientific computing skill | Score |
|---|---|---|---|
| D5a | Protobuf API migration | ML GPU detection | 69.65% |
| D5b | ICS file writing | OCR/tesseract | 71.30% |
| D5c | constraint solvers | KL divergence | 72.21% |
| D5d | C/C++ memory debugging (valgrind) | calendar scheduling | 73.12% |
| D5e | async cancellation | logic circuits | 74.77% |
Finding: Five evolution runs from identical seeds produced substantially different skill content — each evolver wrote a different “textbook” based on which tasks happened to fail in each batch. Despite the content divergence, scores ranged from 69.65% to 74.77%, indicating that multiple distinct skill sets can be effective. This implies that evolved skills are samples from a large space of possibly-useful knowledge, heavily path-dependent on batch ordering and stochastic task outcomes. Ensembling diverse skill sets may be a promising future direction.
6. Self-Verification: Breaking the Ceiling (Group E)
For motivation, our error analysis (Section 3.3) revealed that many flaky task failures stem from unchecked assumptions rather than missing knowledge. The agent solves the task correctly most of the time but occasionally submits without verifying a critical decision — wrong normalization, broken package manager, overfitting to the example.
6.1 Ablation Study: Various Verification Approaches (E1)
We tested five verification delivery mechanisms on the 21 flaky tasks (Opus 4.6, no evolution):
| Experiment | Approach | Score | Change | Avg Turns | Total Tokens |
|---|---|---|---|---|---|
| E1a | Baseline (no verification) | 33.33% | — | 77.8 | 24.34M |
| E1b | Two-step verification-submit workflow | 42.86% | +9.52pp | 78.1 | 25.34M |
| E1c | Optional verify() tool | 47.62% | +14.29pp | 91.0 | 46.46M |
| E1d | Self-verification skill | 66.67% | +33.33pp | 82.0 | 32.29M |
| E1e | Enhanced system prompt | 52.38% | +19.05pp | 69.5 | 18.87M |
| E1f | Skill + prompt combo | 42.86% | +9.52pp | 71.2 | 22.05M |
The self-verification skill (E1d) was the clear winner. Critically, the combination of skill + prompt (E1f) performed worse than either alone — a counterintuitive result with an important explanation.
Why the Combination Hurts
The skill works because it’s lazy-loaded: the model reads it only when uncertain. The prompt instruction forces verification on every task, including those where the model is already confident. When both are present:
- The prompt forces a verification step the model wouldn’t naturally take
- The model also loads the skill (because it’s now in verification mode)
- Redundant double-verification wastes 2-3 turns per task
- The aggregate overhead costs more tasks than extra checking saves
Lesson: Let the model self-select when to verify rather than forcing it universally.
6.2 Scaling to Full Benchmark: The Initial Disappointment (E2)
The ablation showed the gain of self-verification skill on 21 flaky tasks. Then, we ran the same configuration on all 89 tasks with various self-verification skills.
| Experiment | Skill setup | Score | Note |
|---|---|---|---|
| E2a | No predefined seed skills | 72.21% | |
| E2b | Generic self-verification skill as seed skill | 70.16% | |
| E2c | Curated self-verification skill as seed skill | 74.27% |
Finding: Without any predefined skills (E2a), the baseline already achieves 72.21%. Adding a generic self-verification skill (E2b) actually hurts by 2.05pp — the skill contained advice like “check edge cases,” “test empty input,” “verify permissions,” which is knowledge the model already has (the same process-advice trap from Section 5.4). Only when we curated the skill with targeted assumption-checking content (E2c) did verification deliver a +2.06pp gain. This confirms that what the skill says matters far more than that a skill exists.
Guided by failure pattern analysis (Section 3.3), we replaced the generic advice with targeted assumption-checking:
Before (generic):
## 3. Edge Cases
- Check with empty input, special characters, large input
- Verify error handling works
- Check permissions (chmod, ownership)
After (targeted):
## 3. Check Your Assumptions
- If you chose between multiple approaches (e.g., normalization methods),
verify your choice matches what the task/test expects
- If you installed or modified system packages, verify the package manager
still works
- If your solution works on the provided example, consider whether it
generalizes to different inputs
Each bullet addresses a real failure mode:
- Normalization choice: pytorch-model-cli failure (chose MNIST mean/std instead of raw /255)
- Package manager state: dna-assembly failure (apt-get timeout corrupted dpkg)
- Generalization: video-processing failure (MOG2 overfit to example video)
We also enhanced the server verification: “curl or wget to test endpoints — verify response content, not just that the port is open. Checking ports with ss or netstat is NOT enough.” This addresses install-windows-3.11 where the agent verified port 80 was open but never tested nginx content.
6.3 The Nudge: Directing the Model to Read the Skill (E3)
With improved content, we added a single line to the system prompt: “After you think you have completed the task, read the self-verification skill to verify your solution.”
| Experiment | Nudge | Skill Content | Skill Loads | Score |
|---|---|---|---|---|
| E3a | None | Self-verification skill | 24.71% | 74.27% |
| E3b | Verification skill | generic | 77.65% | 68.24% |
| E3c | Verification skill | targeted | 90.59% | 76.47% |
Finding: The system prompt nudge dramatically increases skill loading frequency — from 24.71% (E3a) to 77.65%–90.59% (E3b/E3c). However, loading frequency alone does not guarantee improvement: E3b loads the skill on 77.65% of tasks yet drops performance by 6.03pp because the generic content wastes turns on checks the model already performs. E3c, with targeted content and 90.59% loading, achieves the best result at 76.47%. Lesson: Content quality matters more than loading frequency. The improvement comes from making each verification round productive (catches real bugs) rather than wasteful (checks things the model already knows).
7. Results Summary
7.1 Full Score Progression
Below we present all experiments organized by group. Detailed analysis for each experiment is in its respective section.
Group A: Baselines (Section 2)
| Experiment | Description | Score |
|---|---|---|
| A1 | Opus 4.5 vanilla ReAct agent | 60.15% |
| A2 | Opus 4.6 vanilla ReAct agent | 65.91% |
Group B: Ceiling Analysis (Section 3)
| Experiment | Description | Score |
|---|---|---|
| B1a | Pass@5 across 5 independent A2 runs | 85.88% |
| B1b | 2× per-task timeout | 68.12% |
| B1c | Environment signal (artifact vs ground-truth) | 66.23% |
Group C: Harness Component Exploration (Section 4)
| Experiment | Description | Score |
|---|---|---|
| C1a | Prompt-only evolution (lightweight) | 66.02% |
| C1b | Prompt-only evolution (tailored from B3) | 60.13% |
| C2a | Memory-only evolution (with budgets) | 65.11% |
| C2b | Memory-only evolution (unbounded) | 60.46% |
| C3a | Bash + Python (no Submit tool) | 58.12% |
| C3b | Bash + Python + Submit (= A2) | 65.91% |
| C3c | + Self-generated tools | 59.37% |
| C4a | Skill-only evolution (budgeted) | 70.58% |
| C4b | Skill-only evolution (unbounded) | 68.23% |
| C5a | Skill + Prompt | 67.11% |
| C5b | Skill + Tool | 66.23% |
| C5c | Skill + Memory | 67.02% |
| C5d | Skill + Prompt + Tool | 69.13% |
| C5e | Skill + Prompt + Memory | 67.29% |
| C5f | Skill + Tool + Memory | 64.19% |
| C5g | Skill + Prompt + Tool + Memory | 65.19% |
Group D: Skill Evolution Strategies (Section 5)
| Experiment | Description | Score |
|---|---|---|
| D1a | Skill size 0–500 chars | 67.21% |
| D1b | Skill size 500–1,000 chars | 66.73% |
| D1c | Skill size 1,000–1,500 chars | 68.91% |
| D1d | Skill size 1,500–2,000 chars | 71.23% |
| D1e | Skill size 2,000–2,500 chars | 70.02% |
| D1f | Skill size 2,500–3,000 chars | 68.43% |
| D1g | Skill size 3,000–3,500 chars | 67.19% |
| D2a | Skill count budget = 3 | 70.58% |
| D2b | Skill count budget = 6 | 72.21% |
| D2c | No skill count budget | 68.23% |
| D2d | No budget + per-task keyword selection | 71.94% |
| D3a | Evolver creates from trajectories | 72.21% |
| D3b | Solver creates, evolver refines | 67.35% |
| D3c | Solver creates, evolver generalizes | 69.16% |
| D3d | Evolver creates and judges/refines | 70.37% |
| D4a | Generic skill creation guidance | 72.21% |
| D4b | + FORBIDDEN blocklist (negative prompting) | 73.17% |
| D4c | + positive steering | 70.36% |
| D4d | + negative prompting + positive steering | 73.21% |
| D5a–D5e | 5 runs from identical seed (skill variance) | 69.65%–74.77% |
Group E: Self-Verification (Section 6)
| Experiment | Description | Score | Note |
|---|---|---|---|
| E1a | Baseline — no verification (21 flaky tasks) | 33.33% | Flaky subset only |
| E1b | Two-step verification-submit workflow | 42.86% | Flaky subset only |
| E1c | Optional verify() tool | 47.62% | Flaky subset only |
| E1d | Self-verification skill | 66.67% | Flaky subset only |
| E1e | Enhanced system prompt | 52.38% | Flaky subset only |
| E1f | Skill + prompt combo | 42.86% | Flaky subset only |
| E2a | No predefined seed skills (full benchmark) | 72.21% | |
| E2b | Generic self-verification skill | 70.16% | |
| E2c | Curated self-verification skill | 74.27% | |
| E3a | No nudge, targeted skill | 74.27% | 24.71% skill load rate |
| E3b | Nudge + generic skill | 68.24% | 77.65% skill load rate |
| E3c | Nudge + targeted skill | 76.47% | 90.59% skill load rate |
Key Progression Summary
The path from baseline to best result:
| Stage | Key Innovation | Score | Δ from A2 |
|---|---|---|---|
| A1 | Opus 4.5 baseline | 60.15% | — |
| A2 | Opus 4.6 baseline | 65.91% | — |
| C4a | Skill-only evolution (budgeted) | 70.58% | +4.67pp |
| D2b | Skill count budget = 6 | 72.21% | +6.30pp |
| D4d | FORBIDDEN blocklist | 73.21% | +7.30pp |
| E2c | Curated verification skill | 74.27% | +8.36pp |
| E3c | Targeted verification + nudge | 76.47% | +10.56pp |
7.2 Leaderboard
| System | Score | Notes |
|---|---|---|
| Opus 4.6 (A2) | 65.91% | |
| Copy (Top #7) | 75.3% | |
| A-Evolve (E3C) | 76.47% | |
| ForgeCode (Top #1) | 81.8% |
8. Key Learnings
What Works
Lazy skill loading. Let the model decide which skills to read. Without nudging, the model self-selects on ~25–33% of tasks (E3a, C4a); with a targeted nudge, up to 91% (E3c). System prompt stays at ~1,900 characters instead of 15,000+.
Targeted self-verification. A “Check Your Assumptions” skill that verifies decision choices (normalization method, algorithm selection, package state) rather than generic edge cases. Content quality drives the improvement — generic verification hurts (E2b, -2.05pp) while targeted verification helps (E3c, +10.56pp).
FORBIDDEN blocklist. Blocking generic process advice forces domain-specific skill creation (D4b, +0.96pp over unblocked D4a). Negative prompting outperforms positive steering (D4b vs D4c).
Minimal prompt. The 1,368-character base prompt is near-optimal (C1b shows longer prompts hurt). Total budget including nudges: ~1,485 characters. Every addition must be weighed against attention dilution.
What Doesn’t Work
Solver-proposed skills. The solver extracts task-specific knowledge but fails at generalization (D3b: -4.86pp vs D3a).
Combining verification approaches. Skill + prompt performs worse than either alone (E1f: +9.52pp vs E1d: +33.33pp on flaky tasks).
Meta-skills. A “systematic-exploration” skill was loaded on only 1.18% of tasks. The model does not self-assess when it is stuck.
Unbounded skill injection. Removing skill count budgets degrades performance (D2c: 68.23% vs D2b: 72.21%) even with lazy loading, because the agent loads irrelevant skills.
Design Insights
Error analysis drives improvement. The +9.41pp gain came from trajectory analysis → failure pattern identification → targeted skill content. Human-in-the-loop analysis outperformed all automated evolution.
Simple scaffold wins. Best configuration: minimal prompt (1.5K) + 5 focused skills (8.5K) + lazy loading + no evolution.
9. Future Directions
- Richer evolver signal. The trajectory-only evolver sees compressed logs without pass/fail. Richer feedback (judge score breakdowns, failure localization) could improve skill quality.
- Multi-agent evolver. The current single-agent evolver makes path-dependent decisions (D5) and must handle analysis, ideation, mutation, and quality control in one context window. A multi-agent evolver architecture could decompose these responsibilities: an orchestrator agent coordinates the evolution cycle; an evolver sub-agent mutates workspace files (the current role); an experiment analysis sub-agent performs deep trajectory and metric analysis to surface failure patterns; a direction exploration sub-agent proposes and prioritizes improvement hypotheses; and a code review sub-agent validates mutations for correctness and consistency before they are committed. This separation of concerns mirrors how human research teams operate and could reduce the variance and error accumulation observed in single-agent evolution.
