Designing Self-evolving Coding Agents for Terminal-Bench 2.0

Designing Self-evolving Coding Agents for Terminal-Bench 2.0: A Systematic Exploration

Abstract. We present a systematic exploration of designing self-evolving coding agents for Terminal-Bench 2.0, a benchmark of 89 diverse command-line tasks. Starting from a vanilla ReAct agent powered by Claude Opus 4.6 (65.91%), we conduct large-scale controlled experiments across six groups — establishing baselines, analyzing the performance ceiling (pass@5: 85.88%), exploring four harness components (prompt, memory, tools, skills), refining skill evolution strategies, and developing self-verification mechanisms. Our investigation reveals that among all mutable components, skill evolution delivers the consistent improvement, and nudged self-verification mechanism further boosts the performance. Altogher, our best self-evolution algorithm achieves 76.47%, a +10.56pp improvement over the Opus 4.6 vanilla baseline, ranked top #7 in the terminal bench 2.0 leaderboard.


1. Introduction

1.1 Terminal-Bench 2.0

Terminal-Bench 2.0 (TB2) is a benchmark of 89 diverse command-line tasks spanning system administration, scientific computing, security, data science, and software engineering (E.g., configure web servers, Bayesian networks analysis, ML model training, git history sanitization, and compile software from source). Each task provides a Docker container with a pre-configured environment, a natural language task description, and a hidden test script that validates the solution. Task difficulty ranges widely — from 5-minute file edits to 60-minute multi-step system configurations — with per-task timeouts.

1.2 Agent Architecture

Our solver agent (i.e., Coding agent) employs two agent frameworks: (1) Strands Agents + AgentCore; (2) A standard ReAct loop + Sandbox/Docker. The solver has access to core tools:

The system prompt follows an Analysis/Plan format that structures the agent’s reasoning before each tool call.

1.3 The Evolution Framework (A-Evolve)

A-Evolve deploys a trajectory-only in-situation evolution loop as follows:

Solver Agent (bash/python/submit tools)
    ↓ solves tasks in Docker containers
Trajectory Observer (JSONL logs)
    ↓ records tool calls, outputs, ...
LLM Judge (scores 0-10 without ground-truth)
    ↓ evaluates trajectories/identify potential failure patterns/...
Evolver Agent (LLM agent accessing bash tool to modify the agent harness)
    ↓ mutates prompts/skills/tools/memories based on observations
Skill Library (lazy-loading)/Prompt/Tool Registry/Memory
    ↓ agent reads skills on demand/loads updated prompts/...
Next Batch's Solver

The key design constraint is that evolution operates without ground-truth feedback. The evolver never sees test results — it must infer what went wrong from the trajectory alone. This makes the problem significantly harder than test-guided optimization.

1.4 Experiment Metrics

For solver agent, we measure: (1) pass rate – resolved/total (%); (2) AVG turns; (3) AVG tool calls; (4) AVG used of skills; (5) Running time. For evolver agent, we measure: (1) number of created skills/tools/memories; (2) Changed prompt/memory chars.

1.5 Experiment Notation

We conduct our experiments in a principled way to systematically build a self-evolving coding agent. Particularly, we design the following chronologically thematic groups. Each group is labeled with a letter prefix, and individual experiments within a group are numbered sequentially:

GroupFocusExperimentsSection
ABaseline establishmentA1–A2Section 2
BCeiling analysis and task classificationB1a–B1c, B2, B3Section 3
CHarness component exploration (prompt, memory, tools, skills, combinations)C1a–C1b, C2a–C2b, C3a–C3c, C4a–C4b, C5a–C5gSection 4
DFocused skill evolution strategies (size, count, generation, filtering, variance)D1a–D1g, D2a–D2d, D3a–D3d, D4a–D4d, D5a–D5eSection 5
ESelf-verification test-time scalingE1a–E1f, E2a–E2c, E3a–E3cSection 6

1.6 Infrastructure: API reliability and context management

Our agent infrastructure addresses two key challenges: API reliability and context management. For API reliability, running 20 concurrent AWS Bedrock API calls requires robust error handling — we implement transient error retries (covering read timeouts, content filtering false positives, and connection hangs), timeout clock pausing during retry waits so that idle retry time does not consume the task’s budget, and a wall-clock safety cap to prevent infinite retry loops on persistent outages. For context management, we employ a summarization-based approach: when the conversation reaches the context window limit, we summarize the preceding conversation history into a condensed representation and continue from that summary, preserving critical state while freeing context capacity for continued reasoning.


2. Establishing the Baseline (Group A)

2.1 Opus 4.5 Baseline (A1)

Our initial experiments used Claude Opus 4.5. The vanilla ReAct coding agent scored AVG 60.15% across 5 runs (For simplicity, by default, all results are reported as AVG in 5 runs while some use 3 runs due to the API quota limit).

MetricA1: Opus 4.5
ScoreAVG 60.15%
Mean turns (all tasks)72.84
Mean input tokens/task1,081,840
Mean output tokens/task15,689
Prompt length1,368 chars
Skills loaded0 (no loading mechanism)

2.2 Opus 4.6 Baseline (A2)

When Claude Opus 4.6 was released (February 5, 2026), we used the same ReAct workflow and observed a significant capability jump:

MetricA1: Opus 4.5A2: Opus 4.6Change
ScoreAVG 60.15%AVG 65.91%+5.76pp
Mean turns (all)72.8460.68-16.70%
Mean input tokens/task1,081,840670,784-38.01%
Mean output tokens/task15,68913,910-11.34%
Prompt length1,368 chars1,368 charsSame
Skills loaded00Same

Finding: Opus 4.6 solves tasks with 38.01% fewer input tokens and 16.70% fewer turns, while achieving a higher pass rate. The model is substantially more efficient and capable. In particular, we find Opus 4.6 will automatically self-verify the results.

3. Ceiling Analysis and Task Classification (Group B)

Before attempting to improve performance, we needed to (1) estimate the upper bound of the performance by computing pass@5 from A2 (B1a), increasing the per-task running time limit (B1b), and providing the environment signal in the ReAct loop through comparing the generated artifact and the ground-truth results (B1c). (2) understand the benchmark’s structure: which tasks are deterministically solvable, which are beyond the model’s capability, and which are stochastic. Then, we focus on these stochastic flaky tasks for improvement.

3.1 Ceiling Analysis (B1)

ExperimentDesign detailsPerformance
B1aPass@5 across 5 independent A2 runs (upper bound of flaky stabilization)85.88%
B1b2× per-task timeout (1,800s–7,200s)68.12%
B1cEnvironment signal: compare generated artifact against ground-truth output in the ReAct loop66.23%

Finding: From B1a, we obtained the north star of Terminal-Bench 2.0 to guide our exploration (i.e., 85.88%). Meanwhile, we found providing environment signals does not help much in the coding tasks from B1c. From B1b, increasing the running time can help by ~3pp, indicating TB2 is a resource-heavy benchmark for evaluation.

3.2 Task Stability Classification (B2)

Across 5 independent Opus 4.6 runs, we classified all 89 tasks by their pass rate:

CategoryCountPass RateDescription
Always pass52100.00%Deterministic — model capability sufficient
Flaky (40–80%, unstable)11Pass 2-4/5 timesLikely solvable with small improvements
Flaky (20% Rare pass)10Pass 1/5 timeshard but occasionally solvable
Always fail160.00%Beyond current capability

Finding for the theoretical ceiling: If all flaky tasks could be stabilized, the score would reach 85.88%. The 16 always-fail tasks represent a hard capability boundary.

3.3 Failure Pattern Taxonomy (B3)

Deep trajectory analysis — comparing successful and failed runs of the same task side by side — revealed distinct failure patterns as Finding:

PatternFrequencyDescriptionChallenge
Premature convergenceHighModel locks into a wrong approach early and never backtracksModerate
Semantic reasoning errorMediumModel misreads task requirementsModerate
Hidden eval constraintsMediumTask has implicit requirements not in the descriptionLow
Algorithm stochasticityHighEqually plausible approaches differ in robustnessHigh

These failure patterns help guide our next-stage experiment design for harness improvement (i.e., workflow, prompt, memory, tool, and skill).


4. Exploring Improvement Directions (Group C)

With the baseline, north star, and flaky tasks identified, we systematically explored four workspace components: tools, memory, system prompt, and skills.

4.1 System Prompt: Behavioral Strategy Evolution (C1)

We restricted the evolver to modifying only the system prompt to address the failure cases in B3, e.g., adding behavioral strategy rules like “Comprehensively understand the task and think various directions before starting to solve the problem.”.

ExperimentApproachScorePrompt Length
C1aPrompt-only evolution with lightweight guidance66.02%1,368 -> 1,940 chars
C1bPrompt-only evolution with tailored guidance from B360.13%1,368 -> 4,940 chars

Finding: C1a scored identically to vanilla A2 while longer and complex prompts hurt in C1b. The strategy rules are qualitatively sensible but quantitatively inert — Opus 4.6 already implements these strategies implicitly.

4.2 Memory: Episodic Learning Across Tasks (C2)

We tested memory injection — episodic observations from previous tasks used in future tasks by full loading. Memories accumulated across evolution cycles without providing task-relevant information.

ExperimentApproachScoreMemory Length
C2aMemory-only evolution with memory budgets65.11%1,431 chars
C2bMemory-only evolution without limitations60.46%2,770 chars

Finding: Memory injection was consistently neutral or slightly negative. The evolver stored generic process observations (“always submit your answer,” “save intermediate results”) — advice the model already knows.

4.3 Tools: Universal vs Task-Specific (C3)

We tested the impact of adding pre-defined tools and self-generated tools. The pre-defined tool set is Bash, Python, and Submit. In the self-evolving stage, the evolver agent generates task-specific tools from the task trajectories.

ExperimentApproachScoreNotes
C3aBash + Python58.12% 
C3bBash + Python + Submit65.91%the same setting as A2
C3cBash + Python + Submit + Self-generated tools (e.g., SecurityFileCheck(), Verify())59.37% 

Finding: (1) Submit tool helps because agent does not know when to stop. In the trajectory in C3a, we found agent sometimes continus solving the problem without submitting the results; (2) Self-generated task-specific tools hurt in the terminal/coding tasks. The similar coding tasks have been heavily trained in post-training stages only using universal tools (e.g., grep, bash). (3) The verification logic needs to be richer than what a tool interface naturally supports. The agent often called verify() but performed superficial checks.

4.4 Skills: The Most Promising Direction (C4)

We examine the influence of self-generated skills during evolutions. Skills serve as the domain-specific knowledge and provide guidance for agents when they solve similar tasks. These skills are used via lazy loading.

ExperimentApproachScoreSkill length
C4aSkill-only evolution with skill count budgets70.58%3,012 characters
C4bSkill-only evolution without budgets68.23%8,122 characters

Finding: (1) In C4a, after evolver generated 3 seed skills, the agent chose to read at least one skill on 32.94% of tasks in the remaining tasks. (2) Even with more skills in C4b, we still witness consistent gain from skills. This indicates that skills delivered a measurable improvement when the skills were well-targeted, unlike tools, memory, or prompt changes.

4.5 Auto-Harness: The combination of prompt, tool, memory, and skill (C5)

Furthermore, we test the combination of agent harness. Based on the experiments C1-C4, we know skills help most and the lazy-loading mode will not hurt the performance much. On top of the skill, we explore different harness combination for deep-dive.

ExperimentApproachScoreNote
C5aSkill + Prompt67.11% 
C5bSkill + Tool66.23% 
C5cSkill + Memory67.02% 
C5dSkill + Prompt + Tool69.13% 
C5eSkill + Prompt + Memory67.29% 
C5fSkill + Tool + Memory64.19% 
C5gSkill + Prompt + Tool + Memory65.19% 

Finding: From the series of experiments in C5, we found that skills+{Prompt, tool, memory} hurts except C5d. This indicates that for the terminal bench 2.0, we should focus on the skill evolution for the north star shown in B1 experiment.

5. Skill Evolution Strategies (Group D)

For skill, it employ the lazy loading strategy. Particularly, we only list only skill names and one-line descriptions in the system prompt. The agent calls read_skill(name) to load full content on demand.

5.1 Skill Size Optimization (D1)

We examine the impact of skill size:

ExperimentApproachScoreNote
D1a0~500 chars67.21% 
D1b500~1000 chars66.73% 
D1c1000~1500 chars68.91% 
D1d1500~2000 chars71.23% 
D1e2000~2500 chars70.02% 
D1f2500~3000 chars68.43% 
D1g3000~3500 chars67.19% 

Finding: Skill size exhibits a clear sweet spot at 1,500–2,000 characters (D1d, 71.23%). Skills shorter than 1,000 characters lack sufficient domain detail to be useful, while skills exceeding 2,500 characters dilute the signal with redundant content. The drop-off is roughly symmetric: -4.02pp below the sweet spot (D1a) and -4.04pp above it (D1g). This guides our skill creation constraints in subsequent experiments.

5.2 Skill count budget and Per-Task Keyword Selection (D2)

We investigate the number of added skills. Besides, we also have our mitigation: match skill keywords against the task description, only inject skills with top-k relevance score ranked by N-gram overlapping.

ExperimentSkill DeliveryScore 
D2awith skill count budgets of 370.58% 
D2bwith skill count budgets of 672.21% 
D2cwithout skill count budgets68.23% 
D2dwithout skill count budgets but with per-task keyword selection71.94% 

Finding: A moderate skill budget of 6 (D2b) achieves the best score at 72.21%, outperforming both the tighter budget of 3 (D2a, 70.58%, with ratio of tasks using skills as 32.94%) and the unbounded setting (D2c, 68.23%). Unbounded skill injection degrades performance because the agent loads irrelevant skills that distract from the task. Per-task keyword selection (D2d, 71.94%) partially mitigates the unbounded problem by filtering skills for relevance, recovering most of the gain without a hard count limit.

5.3 Solver-Proposed Skills (D3)

We tested having the solver (with full environment context) propose skill drafts after each task, with the evolver judging the skills and curating proposals.

ExperimentSkill GenerationScoreNote
D3aEvolver creates from trajectories72.21% 
D3bSolver creates, evolver refines67.35% 
D3cSolver creates, evolver generalizes69.16% 
D3dEvolver creates and evolver judges and refines70.37% 

Finding: “Solver creates, evolver refines” (D3b) has the lowest score. The solver excels at extracting task-specific knowledge but fails at generalization. The evolver merged hyper-specific drafts into existing skills, replacing useful content with narrow tips like “ICS calendar parsing” and “OCR configuration.” “Solver creates, evolver generalizes” (D3c) partially recovered by instructing the evolver to “never shrink” existing skills and to generalize across them. However, the best result still comes from the evolver creating skills directly from trajectories (D3a). Lesson: The bottleneck is not knowledge extraction but knowledge generalization — the solver cannot reliably identify what transfers to unseen tasks.

5.4 Case study: The Process-Advice Trap (D4)

When we analyzed the long-horizon trajectory and read the skill content in D1-D3 experiments, we found that the evolver gravitates toward generic process advice. Without constraints, it produces skills like “always check if packages are installed” and “handle timeouts by trying alternatives.” We introduced a FORBIDDEN/REQUIRED blocklist as follows:

FORBIDDEN: Timeout handling, pip tips, session persistence, generic debugging
REQUIRED: Domain-specific libraries, verification steps, domain pitfalls ONLY
ExperimentSkill guidanceScoreNote
D4aGeneric skill creation guidance in evolver agent72.21% 
D4b+ negative prompting73.17% 
D4c+ positive steering70.36% 
D4d+ negative prompting + positive steering73.21% 

Finding: Negative prompting (D4b, blocking bad outputs) was more effective than positive steering (D4c). The combination (D4d) achieves the highest score but only marginally above negative prompting alone (+0.04pp), suggesting most of the gain comes from the blocklist. Without the blocklist, skills accumulate generic process advice; with it, they stay focused on domain-specific knowledge.

5.5 Case Study: Skill Evolution Non-Determinism (D5)

Similar to the non-determinism in LLM response generation and agent execution, we examine the the non-determinism in skill evolution. The five evolution runs from the identical seed. The experiment is shown below.

ExperimentDifferent content of exampler debugging skillDifferent content of scientific computing skillScore
D5aProtobuf API migrationML GPU detection69.65%
D5bICS file writingOCR/tesseract71.30%
D5cconstraint solversKL divergence72.21%
D5dC/C++ memory debugging (valgrind)calendar scheduling73.12%
D5easync cancellationlogic circuits74.77%

Finding: Five evolution runs from identical seeds produced substantially different skill content — each evolver wrote a different “textbook” based on which tasks happened to fail in each batch. Despite the content divergence, scores ranged from 69.65% to 74.77%, indicating that multiple distinct skill sets can be effective. This implies that evolved skills are samples from a large space of possibly-useful knowledge, heavily path-dependent on batch ordering and stochastic task outcomes. Ensembling diverse skill sets may be a promising future direction.


6. Self-Verification: Breaking the Ceiling (Group E)

For motivation, our error analysis (Section 3.3) revealed that many flaky task failures stem from unchecked assumptions rather than missing knowledge. The agent solves the task correctly most of the time but occasionally submits without verifying a critical decision — wrong normalization, broken package manager, overfitting to the example.

6.1 Ablation Study: Various Verification Approaches (E1)

We tested five verification delivery mechanisms on the 21 flaky tasks (Opus 4.6, no evolution):

ExperimentApproachScoreChangeAvg TurnsTotal Tokens
E1aBaseline (no verification)33.33%77.824.34M
E1bTwo-step verification-submit workflow42.86%+9.52pp78.125.34M
E1cOptional verify() tool47.62%+14.29pp91.046.46M
E1dSelf-verification skill66.67%+33.33pp82.032.29M
E1eEnhanced system prompt52.38%+19.05pp69.518.87M
E1fSkill + prompt combo42.86%+9.52pp71.222.05M

The self-verification skill (E1d) was the clear winner. Critically, the combination of skill + prompt (E1f) performed worse than either alone — a counterintuitive result with an important explanation.

Why the Combination Hurts

The skill works because it’s lazy-loaded: the model reads it only when uncertain. The prompt instruction forces verification on every task, including those where the model is already confident. When both are present:

  1. The prompt forces a verification step the model wouldn’t naturally take
  2. The model also loads the skill (because it’s now in verification mode)
  3. Redundant double-verification wastes 2-3 turns per task
  4. The aggregate overhead costs more tasks than extra checking saves

Lesson: Let the model self-select when to verify rather than forcing it universally.

6.2 Scaling to Full Benchmark: The Initial Disappointment (E2)

The ablation showed the gain of self-verification skill on 21 flaky tasks. Then, we ran the same configuration on all 89 tasks with various self-verification skills.

ExperimentSkill setupScoreNote
E2aNo predefined seed skills72.21% 
E2bGeneric self-verification skill as seed skill70.16% 
E2cCurated self-verification skill as seed skill74.27% 

Finding: Without any predefined skills (E2a), the baseline already achieves 72.21%. Adding a generic self-verification skill (E2b) actually hurts by 2.05pp — the skill contained advice like “check edge cases,” “test empty input,” “verify permissions,” which is knowledge the model already has (the same process-advice trap from Section 5.4). Only when we curated the skill with targeted assumption-checking content (E2c) did verification deliver a +2.06pp gain. This confirms that what the skill says matters far more than that a skill exists.

Guided by failure pattern analysis (Section 3.3), we replaced the generic advice with targeted assumption-checking:

Before (generic):

## 3. Edge Cases
- Check with empty input, special characters, large input
- Verify error handling works
- Check permissions (chmod, ownership)

After (targeted):

## 3. Check Your Assumptions
- If you chose between multiple approaches (e.g., normalization methods),
  verify your choice matches what the task/test expects
- If you installed or modified system packages, verify the package manager
  still works
- If your solution works on the provided example, consider whether it
  generalizes to different inputs

Each bullet addresses a real failure mode:

We also enhanced the server verification: curl or wget to test endpoints — verify response content, not just that the port is open. Checking ports with ss or netstat is NOT enough.” This addresses install-windows-3.11 where the agent verified port 80 was open but never tested nginx content.

6.3 The Nudge: Directing the Model to Read the Skill (E3)

With improved content, we added a single line to the system prompt: “After you think you have completed the task, read the self-verification skill to verify your solution.”

ExperimentNudgeSkill ContentSkill LoadsScore
E3aNoneSelf-verification skill24.71%74.27%
E3bVerification skillgeneric77.65%68.24%
E3cVerification skilltargeted90.59%76.47%

Finding: The system prompt nudge dramatically increases skill loading frequency — from 24.71% (E3a) to 77.65%–90.59% (E3b/E3c). However, loading frequency alone does not guarantee improvement: E3b loads the skill on 77.65% of tasks yet drops performance by 6.03pp because the generic content wastes turns on checks the model already performs. E3c, with targeted content and 90.59% loading, achieves the best result at 76.47%. Lesson: Content quality matters more than loading frequency. The improvement comes from making each verification round productive (catches real bugs) rather than wasteful (checks things the model already knows).


7. Results Summary

7.1 Full Score Progression

Below we present all experiments organized by group. Detailed analysis for each experiment is in its respective section.

Group A: Baselines (Section 2)

ExperimentDescriptionScore
A1Opus 4.5 vanilla ReAct agent60.15%
A2Opus 4.6 vanilla ReAct agent65.91%

Group B: Ceiling Analysis (Section 3)

ExperimentDescriptionScore
B1aPass@5 across 5 independent A2 runs85.88%
B1b2× per-task timeout68.12%
B1cEnvironment signal (artifact vs ground-truth)66.23%

Group C: Harness Component Exploration (Section 4)

ExperimentDescriptionScore
C1aPrompt-only evolution (lightweight)66.02%
C1bPrompt-only evolution (tailored from B3)60.13%
C2aMemory-only evolution (with budgets)65.11%
C2bMemory-only evolution (unbounded)60.46%
C3aBash + Python (no Submit tool)58.12%
C3bBash + Python + Submit (= A2)65.91%
C3c+ Self-generated tools59.37%
C4aSkill-only evolution (budgeted)70.58%
C4bSkill-only evolution (unbounded)68.23%
C5aSkill + Prompt67.11%
C5bSkill + Tool66.23%
C5cSkill + Memory67.02%
C5dSkill + Prompt + Tool69.13%
C5eSkill + Prompt + Memory67.29%
C5fSkill + Tool + Memory64.19%
C5gSkill + Prompt + Tool + Memory65.19%

Group D: Skill Evolution Strategies (Section 5)

ExperimentDescriptionScore
D1aSkill size 0–500 chars67.21%
D1bSkill size 500–1,000 chars66.73%
D1cSkill size 1,000–1,500 chars68.91%
D1dSkill size 1,500–2,000 chars71.23%
D1eSkill size 2,000–2,500 chars70.02%
D1fSkill size 2,500–3,000 chars68.43%
D1gSkill size 3,000–3,500 chars67.19%
D2aSkill count budget = 370.58%
D2bSkill count budget = 672.21%
D2cNo skill count budget68.23%
D2dNo budget + per-task keyword selection71.94%
D3aEvolver creates from trajectories72.21%
D3bSolver creates, evolver refines67.35%
D3cSolver creates, evolver generalizes69.16%
D3dEvolver creates and judges/refines70.37%
D4aGeneric skill creation guidance72.21%
D4b+ FORBIDDEN blocklist (negative prompting)73.17%
D4c+ positive steering70.36%
D4d+ negative prompting + positive steering73.21%
D5a–D5e5 runs from identical seed (skill variance)69.65%–74.77%

Group E: Self-Verification (Section 6)

ExperimentDescriptionScoreNote
E1aBaseline — no verification (21 flaky tasks)33.33%Flaky subset only
E1bTwo-step verification-submit workflow42.86%Flaky subset only
E1cOptional verify() tool47.62%Flaky subset only
E1dSelf-verification skill66.67%Flaky subset only
E1eEnhanced system prompt52.38%Flaky subset only
E1fSkill + prompt combo42.86%Flaky subset only
E2aNo predefined seed skills (full benchmark)72.21% 
E2bGeneric self-verification skill70.16% 
E2cCurated self-verification skill74.27% 
E3aNo nudge, targeted skill74.27%24.71% skill load rate
E3bNudge + generic skill68.24%77.65% skill load rate
E3cNudge + targeted skill76.47%90.59% skill load rate

Key Progression Summary

The path from baseline to best result:

StageKey InnovationScoreΔ from A2
A1Opus 4.5 baseline60.15%
A2Opus 4.6 baseline65.91%
C4aSkill-only evolution (budgeted)70.58%+4.67pp
D2bSkill count budget = 672.21%+6.30pp
D4dFORBIDDEN blocklist73.21%+7.30pp
E2cCurated verification skill74.27%+8.36pp
E3cTargeted verification + nudge76.47%+10.56pp

7.2 Leaderboard

SystemScoreNotes
Opus 4.6 (A2)65.91% 
Copy (Top #7)75.3% 
A-Evolve (E3C)76.47% 
ForgeCode (Top #1)81.8% 

8. Key Learnings

What Works

  1. Lazy skill loading. Let the model decide which skills to read. Without nudging, the model self-selects on ~25–33% of tasks (E3a, C4a); with a targeted nudge, up to 91% (E3c). System prompt stays at ~1,900 characters instead of 15,000+.

  2. Targeted self-verification. A “Check Your Assumptions” skill that verifies decision choices (normalization method, algorithm selection, package state) rather than generic edge cases. Content quality drives the improvement — generic verification hurts (E2b, -2.05pp) while targeted verification helps (E3c, +10.56pp).

  3. FORBIDDEN blocklist. Blocking generic process advice forces domain-specific skill creation (D4b, +0.96pp over unblocked D4a). Negative prompting outperforms positive steering (D4b vs D4c).

  4. Minimal prompt. The 1,368-character base prompt is near-optimal (C1b shows longer prompts hurt). Total budget including nudges: ~1,485 characters. Every addition must be weighed against attention dilution.

What Doesn’t Work

  1. Solver-proposed skills. The solver extracts task-specific knowledge but fails at generalization (D3b: -4.86pp vs D3a).

  2. Combining verification approaches. Skill + prompt performs worse than either alone (E1f: +9.52pp vs E1d: +33.33pp on flaky tasks).

  3. Meta-skills. A “systematic-exploration” skill was loaded on only 1.18% of tasks. The model does not self-assess when it is stuck.

  4. Unbounded skill injection. Removing skill count budgets degrades performance (D2c: 68.23% vs D2b: 72.21%) even with lazy loading, because the agent loads irrelevant skills.

Design Insights

  1. Error analysis drives improvement. The +9.41pp gain came from trajectory analysis → failure pattern identification → targeted skill content. Human-in-the-loop analysis outperformed all automated evolution.

  2. Simple scaffold wins. Best configuration: minimal prompt (1.5K) + 5 focused skills (8.5K) + lazy loading + no evolution.


9. Future Directions

  1. Richer evolver signal. The trajectory-only evolver sees compressed logs without pass/fail. Richer feedback (judge score breakdowns, failure localization) could improve skill quality.
  2. Multi-agent evolver. The current single-agent evolver makes path-dependent decisions (D5) and must handle analysis, ideation, mutation, and quality control in one context window. A multi-agent evolver architecture could decompose these responsibilities: an orchestrator agent coordinates the evolution cycle; an evolver sub-agent mutates workspace files (the current role); an experiment analysis sub-agent performs deep trajectory and metric analysis to surface failure patterns; a direction exploration sub-agent proposes and prioritizes improvement hypotheses; and a code review sub-agent validates mutations for correctness and consistency before they are committed. This separation of concerns mirrors how human research teams operate and could reduce the variance and error accumulation observed in single-agent evolution.