Designing Self-evolving Coding Agents for Terminal-Bench 2.0

Designing Self-evolving Coding Agents for Terminal-Bench 2.0: A Systematic Exploration

Abstract. We present a systematic exploration of designing self-evolving coding agents for Terminal-Bench 2.0, a benchmark of 89 diverse command-line tasks. Starting from a vanilla ReAct agent powered by Claude Opus 4.6 (65.91%), we conduct large-scale controlled experiments across seven groups — establishing baselines, analyzing the performance ceiling (pass@5: 85.88%), exploring four harness components (prompt, memory, tools, skills), refining skill evolution strategies, developing self-verification mechanisms, and decomposing evolution into a multi-agent system (MAS-Evolver). Our investigation reveals that among all mutable components, skill evolution delivers the consistent improvement, and nudged self-verification mechanism further boosts the performance. The MAS-Evolver validates that separating analysis, authoring, and adversarial review into specialized agents produces quality-gated skills (+7.03pp over vanilla). Altogether, our best self-evolution algorithm achieves 76.47%, a +10.56pp improvement over the Opus 4.6 vanilla baseline, ranked top #7 in the terminal bench 2.0 leaderboard.

1. Introduction

1.1 Terminal-Bench 2.0

Terminal-Bench 2.0 (TB2) is a benchmark of 89 diverse command-line tasks spanning system administration, scientific computing, security, data science, and software engineering (E.g., configure web servers, Bayesian networks analysis, ML model training, git history sanitization, and compile software from source). Each task provides a Docker container with a pre-configured environment, a natural language task description, and a hidden test script that validates the solution. Task difficulty ranges widely — from 5-minute file edits to 60-minute multi-step system configurations — with per-task timeouts.

1.2 Agent Architecture

Our solver agent (i.e., Coding agent) employs two agent frameworks: (1) Strands Agents + AgentCore; (2) A standard ReAct loop + Sandbox/Docker. The solver has access to core tools:

bash(): Execute shell commands in the Docker container (60s timeout per call)
python(): Execute Python code in an isolated interpreter
submit("DONE"): Signal task completion

The system prompt follows an Analysis/Plan format that structures the agent’s reasoning before each tool call.

1.3 The Evolution Framework (A-Evolve)

A-Evolve deploys a trajectory-only in-situation evolution loop as follows:

Solver Agent (bash/python/submit tools)
    ↓ solves tasks in Docker containers
Trajectory Observer (JSONL logs)
    ↓ records tool calls, outputs, ...
LLM Judge (scores 0-10 without ground-truth)
    ↓ evaluates trajectories/identify potential failure patterns/...
Evolver Agent (LLM agent accessing bash tool to modify the agent harness)
    ↓ mutates prompts/skills/tools/memories based on observations
Skill Library (lazy-loading)/Prompt/Tool Registry/Memory
    ↓ agent reads skills on demand/loads updated prompts/...
Next Batch's Solver

The key design constraint is that evolution operates without ground-truth feedback. The evolver never sees test results — it must infer what went wrong from the trajectory alone. This makes the problem significantly harder than test-guided optimization.

1.4 Experiment Metrics

For solver agent, we measure: (1) pass rate – resolved/total (%); (2) AVG turns; (3) AVG tool calls; (4) AVG used of skills; (5) Running time. For evolver agent, we measure: (1) number of created skills/tools/memories; (2) Changed prompt/memory chars.

1.5 Experiment Notation

We conduct our experiments in a principled way to systematically build a self-evolving coding agent. Particularly, we design the following chronologically thematic groups. Each group is labeled with a letter prefix, and individual experiments within a group are numbered sequentially:

Group	Focus	Experiments	Section
A	Baseline establishment	A1–A2	Section 2
B	Ceiling analysis and task classification	B1a–B1c, B2, B3	Section 3
C	Harness component exploration (prompt, memory, tools, skills, combinations)	C1a–C1b, C2a–C2b, C3a–C3c, C4a–C4b, C5a–C5g	Section 4
D	Focused skill evolution strategies (size, count, generation, filtering, variance)	D1a–D1g, D2a–D2d, D3a–D3d, D4a–D4d, D5a–D5e	Section 5
E	Self-verification test-time scaling	E1a–E1f, E2a–E2c, E3a–E3c	Section 6
F	Multi-Agent System (MAS) Evolver	F1a–F1d, F2a–F2c	Section 7

1.6 Infrastructure: API reliability and context management

Our agent infrastructure addresses two key challenges: API reliability and context management. For API reliability, running 20 concurrent AWS Bedrock API calls requires robust error handling — we implement transient error retries (covering read timeouts, content filtering false positives, and connection hangs), timeout clock pausing during retry waits so that idle retry time does not consume the task’s budget, and a wall-clock safety cap to prevent infinite retry loops on persistent outages. For context management, we employ a summarization-based approach: when the conversation reaches the context window limit, we summarize the preceding conversation history into a condensed representation and continue from that summary, preserving critical state while freeing context capacity for continued reasoning.

2. Establishing the Baseline (Group A)

2.1 Opus 4.5 Baseline (A1)

Our initial experiments used Claude Opus 4.5. The vanilla ReAct coding agent scored AVG 60.15% across 5 runs (For simplicity, by default, all results are reported as AVG in 5 runs while some use 3 runs due to the API quota limit).

Metric	A1: Opus 4.5
Score	AVG 60.15%
Mean turns (all tasks)	72.84
Mean input tokens/task	1,081,840
Mean output tokens/task	15,689
Prompt length	1,368 chars
Skills loaded	0 (no loading mechanism)

2.2 Opus 4.6 Baseline (A2)

When Claude Opus 4.6 was released (February 5, 2026), we used the same ReAct workflow and observed a significant capability jump:

Metric	A1: Opus 4.5	A2: Opus 4.6	Change
Score	AVG 60.15%	AVG 65.91%	+5.76pp
Mean turns (all)	72.84	60.68	-16.70%
Mean input tokens/task	1,081,840	670,784	-38.01%
Mean output tokens/task	15,689	13,910	-11.34%
Prompt length	1,368 chars	1,368 chars	Same
Skills loaded	0	0	Same

Finding: Opus 4.6 solves tasks with 38.01% fewer input tokens and 16.70% fewer turns, while achieving a higher pass rate. The model is substantially more efficient and capable. In particular, we find Opus 4.6 will automatically self-verify the results.

3. Ceiling Analysis and Task Classification (Group B)

Before attempting to improve performance, we needed to (1) estimate the upper bound of the performance by computing pass@5 from A2 (B1a), increasing the per-task running time limit (B1b), and providing the environment signal in the ReAct loop through comparing the generated artifact and the ground-truth results (B1c). (2) understand the benchmark’s structure: which tasks are deterministically solvable, which are beyond the model’s capability, and which are stochastic. Then, we focus on these stochastic flaky tasks for improvement.

3.1 Ceiling Analysis (B1)

Experiment	Design details	Performance
B1a	Pass@5 across 5 independent A2 runs (upper bound of flaky stabilization)	85.88%
B1b	2× per-task timeout (1,800s–7,200s)	68.12%
B1c	Environment signal: compare generated artifact against ground-truth output in the ReAct loop	66.23%

Finding: From B1a, we obtained the north star of Terminal-Bench 2.0 to guide our exploration (i.e., 85.88%). Meanwhile, we found providing environment signals does not help much in the coding tasks from B1c. From B1b, increasing the running time can help by ~3pp, indicating TB2 is a resource-heavy benchmark for evaluation.

3.2 Task Stability Classification (B2)

Across 5 independent Opus 4.6 runs, we classified all 89 tasks by their pass rate:

Category	Count	Pass Rate	Description
Always pass	52	100.00%	Deterministic — model capability sufficient
Flaky (40–80%, unstable)	11	Pass 2-4/5 times	Likely solvable with small improvements
Flaky (20% Rare pass)	10	Pass 1/5 times	hard but occasionally solvable
Always fail	16	0.00%	Beyond current capability

Finding for the theoretical ceiling: If all flaky tasks could be stabilized, the score would reach 85.88%. The 16 always-fail tasks represent a hard capability boundary.

3.3 Failure Pattern Taxonomy (B3)

Deep trajectory analysis — comparing successful and failed runs of the same task side by side — revealed distinct failure patterns as Finding:

Pattern	Frequency	Description	Challenge
Premature convergence	High	Model locks into a wrong approach early and never backtracks	Moderate
Semantic reasoning error	Medium	Model misreads task requirements	Moderate
Hidden eval constraints	Medium	Task has implicit requirements not in the description	Low
Algorithm stochasticity	High	Equally plausible approaches differ in robustness	High

These failure patterns help guide our next-stage experiment design for harness improvement (i.e., workflow, prompt, memory, tool, and skill).

4. Exploring Improvement Directions (Group C)

With the baseline, north star, and flaky tasks identified, we systematically explored four workspace components: tools, memory, system prompt, and skills.

4.1 System Prompt: Behavioral Strategy Evolution (C1)

We restricted the evolver to modifying only the system prompt to address the failure cases in B3, e.g., adding behavioral strategy rules like “Comprehensively understand the task and think various directions before starting to solve the problem.”.

Experiment	Approach	Score	Prompt Length
C1a	Prompt-only evolution with lightweight guidance	66.02%	1,368 -> 1,940 chars
C1b	Prompt-only evolution with tailored guidance from B3	60.13%	1,368 -> 4,940 chars

Finding: C1a scored identically to vanilla A2 while longer and complex prompts hurt in C1b. The strategy rules are qualitatively sensible but quantitatively inert — Opus 4.6 already implements these strategies implicitly.

4.2 Memory: Episodic Learning Across Tasks (C2)

We tested memory injection — episodic observations from previous tasks used in future tasks by full loading. Memories accumulated across evolution cycles without providing task-relevant information.

Experiment	Approach	Score	Memory Length
C2a	Memory-only evolution with memory budgets	65.11%	1,431 chars
C2b	Memory-only evolution without limitations	60.46%	2,770 chars

Finding: Memory injection was consistently neutral or slightly negative. The evolver stored generic process observations (“always submit your answer,” “save intermediate results”) — advice the model already knows.

4.3 Tools: Universal vs Task-Specific (C3)

We tested the impact of adding pre-defined tools and self-generated tools. The pre-defined tool set is Bash, Python, and Submit. In the self-evolving stage, the evolver agent generates task-specific tools from the task trajectories.

Experiment	Approach	Score	Notes
C3a	`Bash` + `Python`	58.12%
C3b	`Bash` + `Python` + `Submit`	65.91%	the same setting as A2
C3c	`Bash` + `Python` + `Submit` + Self-generated tools (e.g., `SecurityFileCheck(), Verify()`)	59.37%

Finding: (1) Submit tool helps because agent does not know when to stop. In the trajectory in C3a, we found agent sometimes continus solving the problem without submitting the results; (2) Self-generated task-specific tools hurt in the terminal/coding tasks. The similar coding tasks have been heavily trained in post-training stages only using universal tools (e.g., grep, bash). (3) The verification logic needs to be richer than what a tool interface naturally supports. The agent often called verify() but performed superficial checks.

4.4 Skills: The Most Promising Direction (C4)

We examine the influence of self-generated skills during evolutions. Skills serve as the domain-specific knowledge and provide guidance for agents when they solve similar tasks. These skills are used via lazy loading.

Experiment	Approach	Score	Skill length
C4a	Skill-only evolution with skill count budgets	70.58%	3,012 characters
C4b	Skill-only evolution without budgets	68.23%	8,122 characters

Finding: (1) In C4a, after evolver generated 3 seed skills, the agent chose to read at least one skill on 32.94% of tasks in the remaining tasks. (2) Even with more skills in C4b, we still witness consistent gain from skills. This indicates that skills delivered a measurable improvement when the skills were well-targeted, unlike tools, memory, or prompt changes.

4.5 Auto-Harness: The combination of prompt, tool, memory, and skill (C5)

Furthermore, we test the combination of agent harness. Based on the experiments C1-C4, we know skills help most and the lazy-loading mode will not hurt the performance much. On top of the skill, we explore different harness combination for deep-dive.

Experiment	Approach	Score
C5a	Skill + Prompt	67.11%
C5b	Skill + Tool	66.23%
C5c	Skill + Memory	67.02%
C5d	Skill + Prompt + Tool	69.13%
C5e	Skill + Prompt + Memory	67.29%
C5f	Skill + Tool + Memory	64.19%
C5g	Skill + Prompt + Tool + Memory	65.19%

Finding: From the series of experiments in C5, we found that skills+{Prompt, tool, memory} hurts except C5d. This indicates that for the terminal bench 2.0, we should focus on the skill evolution for the north star shown in B1 experiment.

5. Skill Evolution Strategies (Group D)

For skill, it employ the lazy loading strategy. Particularly, we only list only skill names and one-line descriptions in the system prompt. The agent calls read_skill(name) to load full content on demand.

5.1 Skill Size Optimization (D1)

We examine the impact of skill size:

Experiment	Approach	Score
D1a	0~500 chars	67.21%
D1b	500~1000 chars	66.73%
D1c	1000~1500 chars	68.91%
D1d	1500~2000 chars	71.23%
D1e	2000~2500 chars	70.02%
D1f	2500~3000 chars	68.43%
D1g	3000~3500 chars	67.19%

Finding: Skill size exhibits a clear sweet spot at 1,500–2,000 characters (D1d, 71.23%). Skills shorter than 1,000 characters lack sufficient domain detail to be useful, while skills exceeding 2,500 characters dilute the signal with redundant content. The drop-off is roughly symmetric: -4.02pp below the sweet spot (D1a) and -4.04pp above it (D1g). This guides our skill creation constraints in subsequent experiments.

5.2 Skill count budget and Per-Task Keyword Selection (D2)

We investigate the number of added skills. Besides, we also have our mitigation: match skill keywords against the task description, only inject skills with top-k relevance score ranked by N-gram overlapping.

Experiment	Skill Delivery	Score
D2a	with skill count budgets of 3	70.58%
D2b	with skill count budgets of 6	72.21%
D2c	without skill count budgets	68.23%
D2d	without skill count budgets but with per-task keyword selection	71.94%

Finding: A moderate skill budget of 6 (D2b) achieves the best score at 72.21%, outperforming both the tighter budget of 3 (D2a, 70.58%, with ratio of tasks using skills as 32.94%) and the unbounded setting (D2c, 68.23%). Unbounded skill injection degrades performance because the agent loads irrelevant skills that distract from the task. Per-task keyword selection (D2d, 71.94%) partially mitigates the unbounded problem by filtering skills for relevance, recovering most of the gain without a hard count limit.

5.3 Solver-Proposed Skills (D3)

We tested having the solver (with full environment context) propose skill drafts after each task, with the evolver judging the skills and curating proposals.

Experiment	Skill Generation	Score
D3a	Evolver creates from trajectories	72.21%
D3b	Solver creates, evolver refines	67.35%
D3c	Solver creates, evolver generalizes	69.16%
D3d	Evolver creates and evolver judges and refines	70.37%

Finding: “Solver creates, evolver refines” (D3b) has the lowest score. The solver excels at extracting task-specific knowledge but fails at generalization. The evolver merged hyper-specific drafts into existing skills, replacing useful content with narrow tips like “ICS calendar parsing” and “OCR configuration.” “Solver creates, evolver generalizes” (D3c) partially recovered by instructing the evolver to “never shrink” existing skills and to generalize across them. However, the best result still comes from the evolver creating skills directly from trajectories (D3a). Lesson: The bottleneck is not knowledge extraction but knowledge generalization — the solver cannot reliably identify what transfers to unseen tasks.

5.4 Case study: The Process-Advice Trap (D4)

When we analyzed the long-horizon trajectory and read the skill content in D1-D3 experiments, we found that the evolver gravitates toward generic process advice. Without constraints, it produces skills like “always check if packages are installed” and “handle timeouts by trying alternatives.” We introduced a FORBIDDEN/REQUIRED blocklist as follows:

FORBIDDEN: Timeout handling, pip tips, session persistence, generic debugging
REQUIRED: Domain-specific libraries, verification steps, domain pitfalls ONLY

Experiment	Skill guidance	Score
D4a	Generic skill creation guidance in evolver agent	72.21%
D4b	+ negative prompting	73.17%
D4c	+ positive steering	70.36%
D4d	+ negative prompting + positive steering	73.21%

Finding: Negative prompting (D4b, blocking bad outputs) was more effective than positive steering (D4c). The combination (D4d) achieves the highest score but only marginally above negative prompting alone (+0.04pp), suggesting most of the gain comes from the blocklist. Without the blocklist, skills accumulate generic process advice; with it, they stay focused on domain-specific knowledge.

5.5 Case Study: Skill Evolution Non-Determinism (D5)

Similar to the non-determinism in LLM response generation and agent execution, we examine the the non-determinism in skill evolution. The five evolution runs from the identical seed. The experiment is shown below.

Experiment	Different content of exampler debugging skill	Different content of scientific computing skill	Score
D5a	Protobuf API migration	ML GPU detection	69.65%
D5b	ICS file writing	OCR/tesseract	71.30%
D5c	constraint solvers	KL divergence	72.21%
D5d	C/C++ memory debugging (valgrind)	calendar scheduling	73.12%
D5e	async cancellation	logic circuits	74.77%

Finding: Five evolution runs from identical seeds produced substantially different skill content — each evolver wrote a different “textbook” based on which tasks happened to fail in each batch. Despite the content divergence, scores ranged from 69.65% to 74.77%, indicating that multiple distinct skill sets can be effective. This implies that evolved skills are samples from a large space of possibly-useful knowledge, heavily path-dependent on batch ordering and stochastic task outcomes. Ensembling diverse skill sets may be a promising future direction.

6. Self-Verification: Breaking the Ceiling (Group E)

For motivation, our error analysis (Section 3.3) revealed that many flaky task failures stem from unchecked assumptions rather than missing knowledge. The agent solves the task correctly most of the time but occasionally submits without verifying a critical decision — wrong normalization, broken package manager, overfitting to the example.

6.1 Ablation Study: Various Verification Approaches (E1)

We tested five verification delivery mechanisms on the 21 flaky tasks (Opus 4.6, no evolution):

Experiment	Approach	Score	Change	Avg Turns	Total Tokens
E1a	Baseline (no verification)	33.33%	—	77.8	24.34M
E1b	Two-step verification-submit workflow	42.86%	+9.52pp	78.1	25.34M
E1c	Optional verify() tool	47.62%	+14.29pp	91.0	46.46M
E1d	Self-verification skill	66.67%	+33.33pp	82.0	32.29M
E1e	Enhanced system prompt	52.38%	+19.05pp	69.5	18.87M
E1f	Skill + prompt combo	42.86%	+9.52pp	71.2	22.05M

The self-verification skill (E1d) was the clear winner. Critically, the combination of skill + prompt (E1f) performed worse than either alone — a counterintuitive result with an important explanation.

Why the Combination Hurts

The skill works because it’s lazy-loaded: the model reads it only when uncertain. The prompt instruction forces verification on every task, including those where the model is already confident. When both are present:

The prompt forces a verification step the model wouldn’t naturally take
The model also loads the skill (because it’s now in verification mode)
Redundant double-verification wastes 2-3 turns per task
The aggregate overhead costs more tasks than extra checking saves

Lesson: Let the model self-select when to verify rather than forcing it universally.

6.2 Scaling to Full Benchmark: The Initial Disappointment (E2)

The ablation showed the gain of self-verification skill on 21 flaky tasks. Then, we ran the same configuration on all 89 tasks with various self-verification skills.

Experiment	Skill setup	Score
E2a	No predefined seed skills	72.21%
E2b	Generic self-verification skill as seed skill	70.16%
E2c	Curated self-verification skill as seed skill	74.27%

Finding: Without any predefined skills (E2a), the baseline already achieves 72.21%. Adding a generic self-verification skill (E2b) actually hurts by 2.05pp — the skill contained advice like “check edge cases,” “test empty input,” “verify permissions,” which is knowledge the model already has (the same process-advice trap from Section 5.4). Only when we curated the skill with targeted assumption-checking content (E2c) did verification deliver a +2.06pp gain. This confirms that what the skill says matters far more than that a skill exists.

Guided by failure pattern analysis (Section 3.3), we replaced the generic advice with targeted assumption-checking:

Before (generic):

## 3. Edge Cases
- Check with empty input, special characters, large input
- Verify error handling works
- Check permissions (chmod, ownership)

After (targeted):

## 3. Check Your Assumptions
- If you chose between multiple approaches (e.g., normalization methods),
  verify your choice matches what the task/test expects
- If you installed or modified system packages, verify the package manager
  still works
- If your solution works on the provided example, consider whether it
  generalizes to different inputs

Each bullet addresses a real failure mode:

Normalization choice: pytorch-model-cli failure (chose MNIST mean/std instead of raw /255)
Package manager state: dna-assembly failure (apt-get timeout corrupted dpkg)
Generalization: video-processing failure (MOG2 overfit to example video)

We also enhanced the server verification: “curl or wget to test endpoints — verify response content, not just that the port is open. Checking ports with ss or netstat is NOT enough.” This addresses install-windows-3.11 where the agent verified port 80 was open but never tested nginx content.

6.3 The Nudge: Directing the Model to Read the Skill (E3)

With improved content, we added a single line to the system prompt: “After you think you have completed the task, read the self-verification skill to verify your solution.”

Experiment	Nudge	Skill Content	Skill Loads	Score
E3a	None	Self-verification skill	24.71%	74.27%
E3b	Verification skill	generic	77.65%	68.24%
E3c	Verification skill	targeted	90.59%	76.47%

Finding: The system prompt nudge dramatically increases skill loading frequency — from 24.71% (E3a) to 77.65%–90.59% (E3b/E3c). However, loading frequency alone does not guarantee improvement: E3b loads the skill on 77.65% of tasks yet drops performance by 6.03pp because the generic content wastes turns on checks the model already performs. E3c, with targeted content and 90.59% loading, achieves the best result at 76.47%. Lesson: Content quality matters more than loading frequency. The improvement comes from making each verification round productive (catches real bugs) rather than wasteful (checks things the model already knows).

7. Multi-Agent System Evolver (Group F)

Our single-agent evolver (used throughout Groups C–E) handles analysis, skill authoring, and quality control within one context window. This creates path dependence (D5) and forces a single agent to context-switch between fundamentally different cognitive tasks. We decomposed skill evolution into four specialized LLM agents — Orchestrator, Analyst, Author, and Critic — called MAS-Evolver.

MAS-Evolver uses an orchestrator-dispatch pattern on the Strands framework:

Agent	Role	Tools	Output
Orchestrator	Coordinate evolution cycle	analyze, author_skill, critique_skill, workspace_bash	Skills written to workspace
Analyst	Identify failure patterns from compressed trajectories	None (pure reasoning)	JSON failure patterns
Author	Create candidate skills (1,500–2,000 chars)	None (pure reasoning)	SKILL.md candidate
Critic	Adversarial review	None (pure reasoning)	Verdict: APPROVE / REVISE / REJECT

Key design decisions: (1) subagents are pure-reasoning with no tools — only the orchestrator mutates the workspace; (2) strict information boundaries prevent bias: the Analyst never sees existing skills, the Author never sees raw trajectories, and the Critic never sees the Author’s reasoning.

7.1 MAS Architecture Comparison (F1)

We compared three multi-agent coordination patterns against vanilla. The orchestrator pattern uses a central coordinator that dynamically sequences subagent calls with an iterative author–critic revision loop (up to 2 rounds per skill). The DAG pattern arranges agents in a fixed directed acyclic graph (Analyst → Author → Critic) with no revision cycles. The swarm pattern allows agents to self-organize without centralized control.

Experiment	System	Accuracy	Note
F1a	Vanilla Opus 4.6 (= A2)	65.91%	—
F1b	Orchestrator MAS design	69.28%	Dynamic sequencing + author–critic loop
F1c	DAG MAS design	67.31%	Fixed pipeline, no revision cycles
F1d	Agent Swarm MAS design	68.45%	Decentralized, no centralized control

Finding: The orchestrator pattern (F1b, 69.28%) outperforms both DAG (F1c, -1.97pp) and swarm (F1d, -0.83pp). The DAG’s fixed pipeline cannot revise rejected skills — once the Critic rejects, the skill is lost. The swarm lacks centralized quality control, producing more skills but with inconsistent quality. The orchestrator’s iterative author–critic loop is the key differentiator: in Cycle 1, MAS-Evolver produced new skills (e.g., numerical-algorithm-implementation — covering log-space arithmetic in R, NaN/Inf guards, and adaptive rejection sampling.). The Critic approved it after one revision round. Cycles 2–4 repeated the process, creating additional skills until the budget was met.

7.2 Context Management in MAS (F2)

With the orchestrator architecture fixed from F1, we investigated how trajectory compression affects skill quality since the context management and agent communication is the key in MAS. Each evolution cycle processes batch trajectories that can be compressed for subagent. We compared three strategies: adaptive compression that adjusts granularity based on content complexity and remaining budget (e.g., context window limit by LLM model); whole-trajectory compression that keeps the task description intact and uniformly compresses the entire trajectory; and sliding-window compression that retains the task description and compresses only within a fixed recent window.

Experiment	System	Accuracy	Note
F2a	Raw trajectory with adaptive compression based on content and budget	72.94%	~2K chars/task
F2b	Kept task description + whole trajectory compression	70.31%
F2c	Kept task description + sliding window compression	68.91%

Finding: Adaptive compression (F2a, 72.94%) outperforms both whole-trajectory (F2b, -2.63pp) and sliding-window (F2c, -4.03pp) compression. Whole-trajectory compression discards too much detail uniformly — critical failure signals (e.g., a wrong normalization choice at turn 12) get compressed away alongside irrelevant boilerplate. Sliding-window compression is worse because it systematically drops early-trajectory context where task understanding and initial approach decisions occur, which are precisely the signals the Analyst needs. Adaptive compression preserves high-information-density segments (errors, decision points, tool output anomalies) while compressing routine segments (package installs, file reads) when the context window limit is reached. Overall, MAS-Evolver with adaptive compression achieves +7.03pp over vanilla, comparable to the single-agent skill evolution baseline (D2b: 72.21%).

8. Results Summary

8.1 Full Score Progression

Below we present all experiments organized by group. Detailed analysis for each experiment is in its respective section.

Group A: Baselines (Section 2)

Experiment	Description	Score
A1	Opus 4.5 vanilla ReAct agent	60.15%
A2	Opus 4.6 vanilla ReAct agent	65.91%

Group B: Ceiling Analysis (Section 3)

Experiment	Description	Score
B1a	Pass@5 across 5 independent A2 runs	85.88%
B1b	2× per-task timeout	68.12%
B1c	Environment signal (artifact vs ground-truth)	66.23%

Group C: Harness Component Exploration (Section 4)

Experiment	Description	Score
C1a	Prompt-only evolution (lightweight)	66.02%
C1b	Prompt-only evolution (tailored from B3)	60.13%
C2a	Memory-only evolution (with budgets)	65.11%
C2b	Memory-only evolution (unbounded)	60.46%
C3a	Bash + Python (no Submit tool)	58.12%
C3b	Bash + Python + Submit (= A2)	65.91%
C3c	+ Self-generated tools	59.37%
C4a	Skill-only evolution (budgeted)	70.58%
C4b	Skill-only evolution (unbounded)	68.23%
C5a	Skill + Prompt	67.11%
C5b	Skill + Tool	66.23%
C5c	Skill + Memory	67.02%
C5d	Skill + Prompt + Tool	69.13%
C5e	Skill + Prompt + Memory	67.29%
C5f	Skill + Tool + Memory	64.19%
C5g	Skill + Prompt + Tool + Memory	65.19%

Group D: Skill Evolution Strategies (Section 5)

Experiment	Description	Score
D1a	Skill size 0–500 chars	67.21%
D1b	Skill size 500–1,000 chars	66.73%
D1c	Skill size 1,000–1,500 chars	68.91%
D1d	Skill size 1,500–2,000 chars	71.23%
D1e	Skill size 2,000–2,500 chars	70.02%
D1f	Skill size 2,500–3,000 chars	68.43%
D1g	Skill size 3,000–3,500 chars	67.19%
D2a	Skill count budget = 3	70.58%
D2b	Skill count budget = 6	72.21%
D2c	No skill count budget	68.23%
D2d	No budget + per-task keyword selection	71.94%
D3a	Evolver creates from trajectories	72.21%
D3b	Solver creates, evolver refines	67.35%
D3c	Solver creates, evolver generalizes	69.16%
D3d	Evolver creates and judges/refines	70.37%
D4a	Generic skill creation guidance	72.21%
D4b	+ FORBIDDEN blocklist (negative prompting)	73.17%
D4c	+ positive steering	70.36%
D4d	+ negative prompting + positive steering	73.21%
D5a–D5e	5 runs from identical seed (skill variance)	69.65%–74.77%

Group E: Self-Verification (Section 6)

Experiment	Description	Score	Note
E1a	Baseline — no verification (21 flaky tasks)	33.33%	Flaky subset only
E1b	Two-step verification-submit workflow	42.86%	Flaky subset only
E1c	Optional verify() tool	47.62%	Flaky subset only
E1d	Self-verification skill	66.67%	Flaky subset only
E1e	Enhanced system prompt	52.38%	Flaky subset only
E1f	Skill + prompt combo	42.86%	Flaky subset only
E2a	No predefined seed skills (full benchmark)	72.21%
E2b	Generic self-verification skill	70.16%
E2c	Curated self-verification skill	74.27%
E3a	No nudge, targeted skill	74.27%	24.71% skill load rate
E3b	Nudge + generic skill	68.24%	77.65% skill load rate
E3c	Nudge + targeted skill	76.47%	90.59% skill load rate

Group F: Multi-Agent System Evolver (Section 7)

Experiment	Description	Score	Note
F1a	Vanilla Opus 4.6 (= A2)	65.91%	—
F1b	Orchestrator MAS design	69.28%	Dynamic sequencing + author–critic loop
F1c	DAG MAS design	67.31%	Fixed pipeline, no revision cycles
F1d	Agent Swarm MAS design	68.45%	Decentralized, no centralized control
F2a	Adaptive compression	72.94%	~2K chars/task
F2b	Whole trajectory compression	70.31%
F2c	Sliding window compression	68.91%

Key Progression Summary

The path from baseline to best result:

Stage	Key Innovation	Score	Δ from A2
A1	Opus 4.5 baseline	60.15%	—
A2	Opus 4.6 baseline	65.91%	—
C4a	Skill-only evolution (budgeted)	70.58%	+4.67pp
D2b	Skill count budget = 6	72.21%	+6.30pp
F2a	MAS-Evolver (multi-agent skill evolution)	72.94%	+7.03pp
D4d	FORBIDDEN blocklist	73.21%	+7.30pp
E2c	Curated verification skill	74.27%	+8.36pp
E3c	Targeted verification + nudge	76.47%	+10.56pp

8.2 Leaderboard

System	Score	Notes
Opus 4.6 (A2)	65.91%
MAS-Evolver (F2a)	72.94%
Copy (Top #7)	75.3%
A-Evolve (E3C)	76.47%
ForgeCode (Top #1)	81.8%

9. Key Learnings

What Works

Lazy skill loading. Let the model decide which skills to read. Without nudging, the model self-selects on ~25–33% of tasks (E3a, C4a); with a targeted nudge, up to 91% (E3c). System prompt stays at ~1,900 characters instead of 15,000+.
Targeted self-verification. A “Check Your Assumptions” wording in self-verification skill that verifies decision choices (normalization method, algorithm selection, package state) rather than generic edge cases. Content quality drives the improvement — generic verification hurts (E2b, -2.05pp) while targeted verification helps (E3c, +10.56pp).
FORBIDDEN blocklist. Blocking generic process advice forces domain-specific skill creation (D4b, +0.96pp over unblocked D4a). Negative prompting outperforms positive steering (D4b vs D4c).
Minimal prompt. The 1,368-character base prompt is near-optimal (C1b shows longer prompts hurt). Total budget including nudges: ~1,485 characters. Every addition must be weighed against attention dilution.

What Doesn’t Work

Solver-proposed skills. The solver extracts task-specific knowledge but fails at generalization (D3b: -4.86pp vs D3a).
Meta-skills. A “systematic-exploration” skill was loaded on only 1.18% of tasks. The model does not self-assess when it is stuck.
Unbounded skill injection. Removing skill count budgets degrades performance (D2c: 68.23% vs D2b: 72.21%) even with lazy loading, because the agent loads irrelevant skills.

Design Insights

Error analysis drives improvement. The +9.41pp gain came from trajectory analysis → failure pattern identification → targeted skill content.
Simple scaffold wins. Best configuration: minimal prompt (1.5K) + 5 focused skills (8.5K) + lazy loading + no evolution.
Multi-agent decomposition validates but doesn’t yet surpass. MAS-Evolver (72.94%) matches the single-agent baseline (D2b: 72.21%) while producing adversarially reviewed skills. The quality gate works — skills pass Critic review — but the architecture has not yet incorporated the refinements (blocklist, targeted verification) that drive the single-agent system to 76.47%.

10. Future Directions

A-Evolve+: Richer signal beyond in-situation evolution. The trajectory-only evolver sees compressed logs without pass/fail. Richer feedback (e.g., the pass/fail result, verifiable feedback) could improve skill quality.

Bing He