Designing Self-evolving Coding Agents for Terminal-Bench 2.0

Designing Self-evolving Coding Agents for Terminal-Bench 2.0: A Systematic Exploration

Abstract. We present a systematic exploration of designing self-evolving coding agents for Terminal-Bench 2.0, a benchmark of 89 diverse command-line tasks. Starting from a vanilla ReAct agent powered by Claude Opus 4.6 (65.91%), we conduct large-scale controlled experiments across six groups — establishing baselines, analyzing the performance ceiling (pass@5: 85.88%), exploring four harness components (prompt, memory, tools, skills), refining skill evolution strategies, and developing self-verification mechanisms. Our investigation reveals that among all mutable components, skill evolution delivers the consistent improvement, and nudged self-verification mechanism further boosts the performance. Altogher, our best self-evolution algorithm achieves 76.47%, a +10.56pp improvement over the Opus 4.6 vanilla baseline, ranked top #7 in the terminal bench 2.0 leaderboard.

1. Introduction

1.1 Terminal-Bench 2.0

Terminal-Bench 2.0 (TB2) is a benchmark of 89 diverse command-line tasks spanning system administration, scientific computing, security, data science, and software engineering (E.g., configure web servers, Bayesian networks analysis, ML model training, git history sanitization, and compile software from source). Each task provides a Docker container with a pre-configured environment, a natural language task description, and a hidden test script that validates the solution. Task difficulty ranges widely — from 5-minute file edits to 60-minute multi-step system configurations — with per-task timeouts.

1.2 Agent Architecture

Our solver agent (i.e., Coding agent) employs two agent frameworks: (1) Strands Agents + AgentCore; (2) A standard ReAct loop + Sandbox/Docker. The solver has access to core tools:

bash(): Execute shell commands in the Docker container (60s timeout per call)
python(): Execute Python code in an isolated interpreter
submit("DONE"): Signal task completion

The system prompt follows an Analysis/Plan format that structures the agent’s reasoning before each tool call.

1.3 The Evolution Framework (A-Evolve)

A-Evolve deploys a trajectory-only in-situation evolution loop as follows:

Solver Agent (bash/python/submit tools)
    ↓ solves tasks in Docker containers
Trajectory Observer (JSONL logs)
    ↓ records tool calls, outputs, ...
LLM Judge (scores 0-10 without ground-truth)
    ↓ evaluates trajectories/identify potential failure patterns/...
Evolver Agent (LLM agent accessing bash tool to modify the agent harness)
    ↓ mutates prompts/skills/tools/memories based on observations
Skill Library (lazy-loading)/Prompt/Tool Registry/Memory
    ↓ agent reads skills on demand/loads updated prompts/...
Next Batch's Solver

The key design constraint is that evolution operates without ground-truth feedback. The evolver never sees test results — it must infer what went wrong from the trajectory alone. This makes the problem significantly harder than test-guided optimization.

1.4 Experiment Metrics

For solver agent, we measure: (1) pass rate – resolved/total (%); (2) AVG turns; (3) AVG tool calls; (4) AVG used of skills; (5) Running time. For evolver agent, we measure: (1) number of created skills/tools/memories; (2) Changed prompt/memory chars.

1.5 Experiment Notation

We conduct our experiments in a principled way to systematically build a self-evolving coding agent. Particularly, we design the following chronologically thematic groups. Each group is labeled with a letter prefix, and individual experiments within a group are numbered sequentially:

Group	Focus	Experiments	Section
A	Baseline establishment	A1–A2	Section 2
B	Ceiling analysis and task classification	B1a–B1c, B2, B3	Section 3
C	Harness component exploration (prompt, memory, tools, skills, combinations)	C1a–C1b, C2a–C2b, C3a–C3c, C4a–C4b, C5a–C5g	Section 4
D	Focused skill evolution strategies (size, count, generation, filtering, variance)	D1a–D1g, D2a–D2d, D3a–D3d, D4a–D4d, D5a–D5e	Section 5
E	Self-verification test-time scaling	E1a–E1f, E2a–E2c, E3a–E3c	Section 6

1.6 Infrastructure: API reliability and context management

Our agent infrastructure addresses two key challenges: API reliability and context management. For API reliability, running 20 concurrent AWS Bedrock API calls requires robust error handling — we implement transient error retries (covering read timeouts, content filtering false positives, and connection hangs), timeout clock pausing during retry waits so that idle retry time does not consume the task’s budget, and a wall-clock safety cap to prevent infinite retry loops on persistent outages. For context management, we employ a summarization-based approach: when the conversation reaches the context window limit, we summarize the preceding conversation history into a condensed representation and continue from that summary, preserving critical state while freeing context capacity for continued reasoning.

2. Establishing the Baseline (Group A)

2.1 Opus 4.5 Baseline (A1)

Our initial experiments used Claude Opus 4.5. The vanilla ReAct coding agent scored AVG 60.15% across 5 runs (For simplicity, by default, all results are reported as AVG in 5 runs while some use 3 runs due to the API quota limit).

Metric	A1: Opus 4.5
Score	AVG 60.15%
Mean turns (all tasks)	72.84
Mean input tokens/task	1,081,840
Mean output tokens/task	15,689
Prompt length	1,368 chars
Skills loaded	0 (no loading mechanism)

2.2 Opus 4.6 Baseline (A2)

When Claude Opus 4.6 was released (February 5, 2026), we used the same ReAct workflow and observed a significant capability jump:

Metric	A1: Opus 4.5	A2: Opus 4.6	Change
Score	AVG 60.15%	AVG 65.91%	+5.76pp
Mean turns (all)	72.84	60.68	-16.70%
Mean input tokens/task	1,081,840	670,784	-38.01%
Mean output tokens/task	15,689	13,910	-11.34%
Prompt length	1,368 chars	1,368 chars	Same
Skills loaded	0	0	Same

Finding: Opus 4.6 solves tasks with 38.01% fewer input tokens and 16.70% fewer turns, while achieving a higher pass rate. The model is substantially more efficient and capable. In particular, we find Opus 4.6 will automatically self-verify the results.

3. Ceiling Analysis and Task Classification (Group B)

Before attempting to improve performance, we needed to (1) estimate the upper bound of the performance by computing pass@5 from A2 (B1a), increasing the per-task running time limit (B1b), and providing the environment signal in the ReAct loop through comparing the generated artifact and the ground-truth results (B1c). (2) understand the benchmark’s structure: which tasks are deterministically solvable, which are beyond the model’s capability, and which are stochastic. Then, we focus on these stochastic flaky tasks for improvement.

3.1 Ceiling Analysis (B1)

Experiment	Design details	Performance
B1a	Pass@5 across 5 independent A2 runs (upper bound of flaky stabilization)	85.88%
B1b	2× per-task timeout (1,800s–7,200s)	68.12%
B1c	Environment signal: compare generated artifact against ground-truth output in the ReAct loop	66.23%

Finding: From B1a, we obtained the north star of Terminal-Bench 2.0 to guide our exploration (i.e., 85.88%). Meanwhile, we found providing environment signals does not help much in the coding tasks from B1c. From B1b, increasing the running time can help by ~3pp, indicating TB2 is a resource-heavy benchmark for evaluation.

3.2 Task Stability Classification (B2)

Across 5 independent Opus 4.6 runs, we classified all 89 tasks by their pass rate:

Category	Count	Pass Rate	Description
Always pass	52	100.00%	Deterministic — model capability sufficient
Flaky (40–80%, unstable)	11	Pass 2-4/5 times	Likely solvable with small improvements
Flaky (20% Rare pass)	10	Pass 1/5 times	hard but occasionally solvable
Always fail	16	0.00%	Beyond current capability

Finding for the theoretical ceiling: If all flaky tasks could be stabilized, the score would reach 85.88%. The 16 always-fail tasks represent a hard capability boundary.

3.3 Failure Pattern Taxonomy (B3)

Deep trajectory analysis — comparing successful and failed runs of the same task side by side — revealed distinct failure patterns as Finding:

Pattern	Frequency	Description	Challenge
Premature convergence	High	Model locks into a wrong approach early and never backtracks	Moderate
Semantic reasoning error	Medium	Model misreads task requirements	Moderate
Hidden eval constraints	Medium	Task has implicit requirements not in the description	Low
Algorithm stochasticity	High	Equally plausible approaches differ in robustness	High

These failure patterns help guide our next-stage experiment design for harness improvement (i.e., workflow, prompt, memory, tool, and skill).

4. Exploring Improvement Directions (Group C)

With the baseline, north star, and flaky tasks identified, we systematically explored four workspace components: tools, memory, system prompt, and skills.

4.1 System Prompt: Behavioral Strategy Evolution (C1)

We restricted the evolver to modifying only the system prompt to address the failure cases in B3, e.g., adding behavioral strategy rules like “Comprehensively understand the task and think various directions before starting to solve the problem.”.

Experiment	Approach	Score	Prompt Length
C1a	Prompt-only evolution with lightweight guidance	66.02%	1,368 -> 1,940 chars
C1b	Prompt-only evolution with tailored guidance from B3	60.13%	1,368 -> 4,940 chars

Finding: C1a scored identically to vanilla A2 while longer and complex prompts hurt in C1b. The strategy rules are qualitatively sensible but quantitatively inert — Opus 4.6 already implements these strategies implicitly.

4.2 Memory: Episodic Learning Across Tasks (C2)

We tested memory injection — episodic observations from previous tasks used in future tasks by full loading. Memories accumulated across evolution cycles without providing task-relevant information.

Experiment	Approach	Score	Memory Length
C2a	Memory-only evolution with memory budgets	65.11%	1,431 chars
C2b	Memory-only evolution without limitations	60.46%	2,770 chars

Finding: Memory injection was consistently neutral or slightly negative. The evolver stored generic process observations (“always submit your answer,” “save intermediate results”) — advice the model already knows.

4.3 Tools: Universal vs Task-Specific (C3)

We tested the impact of adding pre-defined tools and self-generated tools. The pre-defined tool set is Bash, Python, and Submit. In the self-evolving stage, the evolver agent generates task-specific tools from the task trajectories.

Experiment	Approach	Score	Notes
C3a	`Bash` + `Python`	58.12%
C3b	`Bash` + `Python` + `Submit`	65.91%	the same setting as A2
C3c	`Bash` + `Python` + `Submit` + Self-generated tools (e.g., `SecurityFileCheck(), Verify()`)	59.37%

Finding: (1) Submit tool helps because agent does not know when to stop. In the trajectory in C3a, we found agent sometimes continus solving the problem without submitting the results; (2) Self-generated task-specific tools hurt in the terminal/coding tasks. The similar coding tasks have been heavily trained in post-training stages only using universal tools (e.g., grep, bash). (3) The verification logic needs to be richer than what a tool interface naturally supports. The agent often called verify() but performed superficial checks.

4.4 Skills: The Most Promising Direction (C4)

We examine the influence of self-generated skills during evolutions. Skills serve as the domain-specific knowledge and provide guidance for agents when they solve similar tasks. These skills are used via lazy loading.

Experiment	Approach	Score	Skill length
C4a	Skill-only evolution with skill count budgets	70.58%	3,012 characters
C4b	Skill-only evolution without budgets	68.23%	8,122 characters

Finding: (1) In C4a, after evolver generated 3 seed skills, the agent chose to read at least one skill on 32.94% of tasks in the remaining tasks. (2) Even with more skills in C4b, we still witness consistent gain from skills. This indicates that skills delivered a measurable improvement when the skills were well-targeted, unlike tools, memory, or prompt changes.

4.5 Auto-Harness: The combination of prompt, tool, memory, and skill (C5)

Furthermore, we test the combination of agent harness. Based on the experiments C1-C4, we know skills help most and the lazy-loading mode will not hurt the performance much. On top of the skill, we explore different harness combination for deep-dive.

Experiment	Approach	Score
C5a	Skill + Prompt	67.11%
C5b	Skill + Tool	66.23%
C5c	Skill + Memory	67.02%
C5d	Skill + Prompt + Tool	69.13%
C5e	Skill + Prompt + Memory	67.29%
C5f	Skill + Tool + Memory	64.19%
C5g	Skill + Prompt + Tool + Memory	65.19%

Finding: From the series of experiments in C5, we found that skills+{Prompt, tool, memory} hurts except C5d. This indicates that for the terminal bench 2.0, we should focus on the skill evolution for the north star shown in B1 experiment.

5. Skill Evolution Strategies (Group D)

For skill, it employ the lazy loading strategy. Particularly, we only list only skill names and one-line descriptions in the system prompt. The agent calls read_skill(name) to load full content on demand.

5.1 Skill Size Optimization (D1)

We examine the impact of skill size:

Experiment	Approach	Score
D1a	0~500 chars	67.21%
D1b	500~1000 chars	66.73%
D1c	1000~1500 chars	68.91%
D1d	1500~2000 chars	71.23%
D1e	2000~2500 chars	70.02%
D1f	2500~3000 chars	68.43%
D1g	3000~3500 chars	67.19%

Finding: Skill size exhibits a clear sweet spot at 1,500–2,000 characters (D1d, 71.23%). Skills shorter than 1,000 characters lack sufficient domain detail to be useful, while skills exceeding 2,500 characters dilute the signal with redundant content. The drop-off is roughly symmetric: -4.02pp below the sweet spot (D1a) and -4.04pp above it (D1g). This guides our skill creation constraints in subsequent experiments.

5.2 Skill count budget and Per-Task Keyword Selection (D2)

We investigate the number of added skills. Besides, we also have our mitigation: match skill keywords against the task description, only inject skills with top-k relevance score ranked by N-gram overlapping.

Experiment	Skill Delivery	Score
D2a	with skill count budgets of 3	70.58%
D2b	with skill count budgets of 6	72.21%
D2c	without skill count budgets	68.23%
D2d	without skill count budgets but with per-task keyword selection	71.94%

Finding: A moderate skill budget of 6 (D2b) achieves the best score at 72.21%, outperforming both the tighter budget of 3 (D2a, 70.58%, with ratio of tasks using skills as 32.94%) and the unbounded setting (D2c, 68.23%). Unbounded skill injection degrades performance because the agent loads irrelevant skills that distract from the task. Per-task keyword selection (D2d, 71.94%) partially mitigates the unbounded problem by filtering skills for relevance, recovering most of the gain without a hard count limit.

5.3 Solver-Proposed Skills (D3)

We tested having the solver (with full environment context) propose skill drafts after each task, with the evolver judging the skills and curating proposals.

Experiment	Skill Generation	Score
D3a	Evolver creates from trajectories	72.21%
D3b	Solver creates, evolver refines	67.35%
D3c	Solver creates, evolver generalizes	69.16%
D3d	Evolver creates and evolver judges and refines	70.37%

Finding: “Solver creates, evolver refines” (D3b) has the lowest score. The solver excels at extracting task-specific knowledge but fails at generalization. The evolver merged hyper-specific drafts into existing skills, replacing useful content with narrow tips like “ICS calendar parsing” and “OCR configuration.” “Solver creates, evolver generalizes” (D3c) partially recovered by instructing the evolver to “never shrink” existing skills and to generalize across them. However, the best result still comes from the evolver creating skills directly from trajectories (D3a). Lesson: The bottleneck is not knowledge extraction but knowledge generalization — the solver cannot reliably identify what transfers to unseen tasks.

5.4 Case study: The Process-Advice Trap (D4)

When we analyzed the long-horizon trajectory and read the skill content in D1-D3 experiments, we found that the evolver gravitates toward generic process advice. Without constraints, it produces skills like “always check if packages are installed” and “handle timeouts by trying alternatives.” We introduced a FORBIDDEN/REQUIRED blocklist as follows:

FORBIDDEN: Timeout handling, pip tips, session persistence, generic debugging
REQUIRED: Domain-specific libraries, verification steps, domain pitfalls ONLY

Experiment	Skill guidance	Score
D4a	Generic skill creation guidance in evolver agent	72.21%
D4b	+ negative prompting	73.17%
D4c	+ positive steering	70.36%
D4d	+ negative prompting + positive steering	73.21%

Finding: Negative prompting (D4b, blocking bad outputs) was more effective than positive steering (D4c). The combination (D4d) achieves the highest score but only marginally above negative prompting alone (+0.04pp), suggesting most of the gain comes from the blocklist. Without the blocklist, skills accumulate generic process advice; with it, they stay focused on domain-specific knowledge.

5.5 Case Study: Skill Evolution Non-Determinism (D5)

Similar to the non-determinism in LLM response generation and agent execution, we examine the the non-determinism in skill evolution. The five evolution runs from the identical seed. The experiment is shown below.

Experiment	Different content of exampler debugging skill	Different content of scientific computing skill	Score
D5a	Protobuf API migration	ML GPU detection	69.65%
D5b	ICS file writing	OCR/tesseract	71.30%
D5c	constraint solvers	KL divergence	72.21%
D5d	C/C++ memory debugging (valgrind)	calendar scheduling	73.12%
D5e	async cancellation	logic circuits	74.77%

Finding: Five evolution runs from identical seeds produced substantially different skill content — each evolver wrote a different “textbook” based on which tasks happened to fail in each batch. Despite the content divergence, scores ranged from 69.65% to 74.77%, indicating that multiple distinct skill sets can be effective. This implies that evolved skills are samples from a large space of possibly-useful knowledge, heavily path-dependent on batch ordering and stochastic task outcomes. Ensembling diverse skill sets may be a promising future direction.

6. Self-Verification: Breaking the Ceiling (Group E)

For motivation, our error analysis (Section 3.3) revealed that many flaky task failures stem from unchecked assumptions rather than missing knowledge. The agent solves the task correctly most of the time but occasionally submits without verifying a critical decision — wrong normalization, broken package manager, overfitting to the example.

6.1 Ablation Study: Various Verification Approaches (E1)

We tested five verification delivery mechanisms on the 21 flaky tasks (Opus 4.6, no evolution):

Experiment	Approach	Score	Change	Avg Turns	Total Tokens
E1a	Baseline (no verification)	33.33%	—	77.8	24.34M
E1b	Two-step verification-submit workflow	42.86%	+9.52pp	78.1	25.34M
E1c	Optional verify() tool	47.62%	+14.29pp	91.0	46.46M
E1d	Self-verification skill	66.67%	+33.33pp	82.0	32.29M
E1e	Enhanced system prompt	52.38%	+19.05pp	69.5	18.87M
E1f	Skill + prompt combo	42.86%	+9.52pp	71.2	22.05M

The self-verification skill (E1d) was the clear winner. Critically, the combination of skill + prompt (E1f) performed worse than either alone — a counterintuitive result with an important explanation.

Why the Combination Hurts

The skill works because it’s lazy-loaded: the model reads it only when uncertain. The prompt instruction forces verification on every task, including those where the model is already confident. When both are present:

The prompt forces a verification step the model wouldn’t naturally take
The model also loads the skill (because it’s now in verification mode)
Redundant double-verification wastes 2-3 turns per task
The aggregate overhead costs more tasks than extra checking saves

Lesson: Let the model self-select when to verify rather than forcing it universally.

6.2 Scaling to Full Benchmark: The Initial Disappointment (E2)

The ablation showed the gain of self-verification skill on 21 flaky tasks. Then, we ran the same configuration on all 89 tasks with various self-verification skills.

Experiment	Skill setup	Score
E2a	No predefined seed skills	72.21%
E2b	Generic self-verification skill as seed skill	70.16%
E2c	Curated self-verification skill as seed skill	74.27%

Finding: Without any predefined skills (E2a), the baseline already achieves 72.21%. Adding a generic self-verification skill (E2b) actually hurts by 2.05pp — the skill contained advice like “check edge cases,” “test empty input,” “verify permissions,” which is knowledge the model already has (the same process-advice trap from Section 5.4). Only when we curated the skill with targeted assumption-checking content (E2c) did verification deliver a +2.06pp gain. This confirms that what the skill says matters far more than that a skill exists.

Guided by failure pattern analysis (Section 3.3), we replaced the generic advice with targeted assumption-checking:

Before (generic):

## 3. Edge Cases
- Check with empty input, special characters, large input
- Verify error handling works
- Check permissions (chmod, ownership)

After (targeted):

## 3. Check Your Assumptions
- If you chose between multiple approaches (e.g., normalization methods),
  verify your choice matches what the task/test expects
- If you installed or modified system packages, verify the package manager
  still works
- If your solution works on the provided example, consider whether it
  generalizes to different inputs

Each bullet addresses a real failure mode:

Normalization choice: pytorch-model-cli failure (chose MNIST mean/std instead of raw /255)
Package manager state: dna-assembly failure (apt-get timeout corrupted dpkg)
Generalization: video-processing failure (MOG2 overfit to example video)

We also enhanced the server verification: “curl or wget to test endpoints — verify response content, not just that the port is open. Checking ports with ss or netstat is NOT enough.” This addresses install-windows-3.11 where the agent verified port 80 was open but never tested nginx content.

6.3 The Nudge: Directing the Model to Read the Skill (E3)

With improved content, we added a single line to the system prompt: “After you think you have completed the task, read the self-verification skill to verify your solution.”

Experiment	Nudge	Skill Content	Skill Loads	Score
E3a	None	Self-verification skill	24.71%	74.27%
E3b	Verification skill	generic	77.65%	68.24%
E3c	Verification skill	targeted	90.59%	76.47%

Finding: The system prompt nudge dramatically increases skill loading frequency — from 24.71% (E3a) to 77.65%–90.59% (E3b/E3c). However, loading frequency alone does not guarantee improvement: E3b loads the skill on 77.65% of tasks yet drops performance by 6.03pp because the generic content wastes turns on checks the model already performs. E3c, with targeted content and 90.59% loading, achieves the best result at 76.47%. Lesson: Content quality matters more than loading frequency. The improvement comes from making each verification round productive (catches real bugs) rather than wasteful (checks things the model already knows).

7. Results Summary

7.1 Full Score Progression

Below we present all experiments organized by group. Detailed analysis for each experiment is in its respective section.

Group A: Baselines (Section 2)

Experiment	Description	Score
A1	Opus 4.5 vanilla ReAct agent	60.15%
A2	Opus 4.6 vanilla ReAct agent	65.91%

Group B: Ceiling Analysis (Section 3)

Experiment	Description	Score
B1a	Pass@5 across 5 independent A2 runs	85.88%
B1b	2× per-task timeout	68.12%
B1c	Environment signal (artifact vs ground-truth)	66.23%

Group C: Harness Component Exploration (Section 4)

Experiment	Description	Score
C1a	Prompt-only evolution (lightweight)	66.02%
C1b	Prompt-only evolution (tailored from B3)	60.13%
C2a	Memory-only evolution (with budgets)	65.11%
C2b	Memory-only evolution (unbounded)	60.46%
C3a	Bash + Python (no Submit tool)	58.12%
C3b	Bash + Python + Submit (= A2)	65.91%
C3c	+ Self-generated tools	59.37%
C4a	Skill-only evolution (budgeted)	70.58%
C4b	Skill-only evolution (unbounded)	68.23%
C5a	Skill + Prompt	67.11%
C5b	Skill + Tool	66.23%
C5c	Skill + Memory	67.02%
C5d	Skill + Prompt + Tool	69.13%
C5e	Skill + Prompt + Memory	67.29%
C5f	Skill + Tool + Memory	64.19%
C5g	Skill + Prompt + Tool + Memory	65.19%

Group D: Skill Evolution Strategies (Section 5)

Experiment	Description	Score
D1a	Skill size 0–500 chars	67.21%
D1b	Skill size 500–1,000 chars	66.73%
D1c	Skill size 1,000–1,500 chars	68.91%
D1d	Skill size 1,500–2,000 chars	71.23%
D1e	Skill size 2,000–2,500 chars	70.02%
D1f	Skill size 2,500–3,000 chars	68.43%
D1g	Skill size 3,000–3,500 chars	67.19%
D2a	Skill count budget = 3	70.58%
D2b	Skill count budget = 6	72.21%
D2c	No skill count budget	68.23%
D2d	No budget + per-task keyword selection	71.94%
D3a	Evolver creates from trajectories	72.21%
D3b	Solver creates, evolver refines	67.35%
D3c	Solver creates, evolver generalizes	69.16%
D3d	Evolver creates and judges/refines	70.37%
D4a	Generic skill creation guidance	72.21%
D4b	+ FORBIDDEN blocklist (negative prompting)	73.17%
D4c	+ positive steering	70.36%
D4d	+ negative prompting + positive steering	73.21%
D5a–D5e	5 runs from identical seed (skill variance)	69.65%–74.77%

Group E: Self-Verification (Section 6)

Experiment	Description	Score	Note
E1a	Baseline — no verification (21 flaky tasks)	33.33%	Flaky subset only
E1b	Two-step verification-submit workflow	42.86%	Flaky subset only
E1c	Optional verify() tool	47.62%	Flaky subset only
E1d	Self-verification skill	66.67%	Flaky subset only
E1e	Enhanced system prompt	52.38%	Flaky subset only
E1f	Skill + prompt combo	42.86%	Flaky subset only
E2a	No predefined seed skills (full benchmark)	72.21%
E2b	Generic self-verification skill	70.16%
E2c	Curated self-verification skill	74.27%
E3a	No nudge, targeted skill	74.27%	24.71% skill load rate
E3b	Nudge + generic skill	68.24%	77.65% skill load rate
E3c	Nudge + targeted skill	76.47%	90.59% skill load rate

Key Progression Summary

The path from baseline to best result:

Stage	Key Innovation	Score	Δ from A2
A1	Opus 4.5 baseline	60.15%	—
A2	Opus 4.6 baseline	65.91%	—
C4a	Skill-only evolution (budgeted)	70.58%	+4.67pp
D2b	Skill count budget = 6	72.21%	+6.30pp
D4d	FORBIDDEN blocklist	73.21%	+7.30pp
E2c	Curated verification skill	74.27%	+8.36pp
E3c	Targeted verification + nudge	76.47%	+10.56pp

7.2 Leaderboard

System	Score	Notes
Opus 4.6 (A2)	65.91%
Copy (Top #7)	75.3%
A-Evolve (E3C)	76.47%
ForgeCode (Top #1)	81.8%

8. Key Learnings

What Works

Lazy skill loading. Let the model decide which skills to read. Without nudging, the model self-selects on ~25–33% of tasks (E3a, C4a); with a targeted nudge, up to 91% (E3c). System prompt stays at ~1,900 characters instead of 15,000+.
Targeted self-verification. A “Check Your Assumptions” skill that verifies decision choices (normalization method, algorithm selection, package state) rather than generic edge cases. Content quality drives the improvement — generic verification hurts (E2b, -2.05pp) while targeted verification helps (E3c, +10.56pp).
FORBIDDEN blocklist. Blocking generic process advice forces domain-specific skill creation (D4b, +0.96pp over unblocked D4a). Negative prompting outperforms positive steering (D4b vs D4c).
Minimal prompt. The 1,368-character base prompt is near-optimal (C1b shows longer prompts hurt). Total budget including nudges: ~1,485 characters. Every addition must be weighed against attention dilution.

What Doesn’t Work

Solver-proposed skills. The solver extracts task-specific knowledge but fails at generalization (D3b: -4.86pp vs D3a).
Combining verification approaches. Skill + prompt performs worse than either alone (E1f: +9.52pp vs E1d: +33.33pp on flaky tasks).
Meta-skills. A “systematic-exploration” skill was loaded on only 1.18% of tasks. The model does not self-assess when it is stuck.
Unbounded skill injection. Removing skill count budgets degrades performance (D2c: 68.23% vs D2b: 72.21%) even with lazy loading, because the agent loads irrelevant skills.

Design Insights

Error analysis drives improvement. The +9.41pp gain came from trajectory analysis → failure pattern identification → targeted skill content. Human-in-the-loop analysis outperformed all automated evolution.
Simple scaffold wins. Best configuration: minimal prompt (1.5K) + 5 focused skills (8.5K) + lazy loading + no evolution.

9. Future Directions

Richer evolver signal. The trajectory-only evolver sees compressed logs without pass/fail. Richer feedback (judge score breakdowns, failure localization) could improve skill quality.
Multi-agent evolver. The current single-agent evolver makes path-dependent decisions (D5) and must handle analysis, ideation, mutation, and quality control in one context window. A multi-agent evolver architecture could decompose these responsibilities: an orchestrator agent coordinates the evolution cycle; an evolver sub-agent mutates workspace files (the current role); an experiment analysis sub-agent performs deep trajectory and metric analysis to surface failure patterns; a direction exploration sub-agent proposes and prioritizes improvement hypotheses; and a code review sub-agent validates mutations for correctness and consistency before they are committed. This separation of concerns mirrors how human research teams operate and could reduce the variance and error accumulation observed in single-agent evolution.

Bing He