Designing Self-evolving Coding Agents for Terminal-Bench 2.0
Designing Self-evolving Coding Agents for Terminal-Bench 2.0: A Systematic Exploration
Abstract. We present a systematic exploration of designing self-evolving coding agents for Terminal-Bench 2.0, a benchmark of 89 diverse command-line tasks. Starting from a vanilla ReAct agent powered by Claude Opus 4.6 (65.91%), we conduct large-scale controlled experiments across seven groups — establishing baselines, analyzing the performance ceiling (pass@5: 85.88%), exploring four harness components (prompt, memory, tools, skills), refining skill evolution strategies, developing self-verification mechanisms, and decomposing evolution into a multi-agent system (MAS-Evolver). Our investigation reveals that among all mutable components, skill evolution delivers the consistent improvement, and nudged self-verification mechanism further boosts the performance. The MAS-Evolver validates that separating analysis, authoring, and adversarial review into specialized agents produces quality-gated skills (+7.03pp over vanilla). Altogether, our best self-evolution algorithm achieves 76.47%, a +10.56pp improvement over the Opus 4.6 vanilla baseline, ranked top #7 in the terminal bench 2.0 leaderboard.
1. Introduction
1.1 Terminal-Bench 2.0
Terminal-Bench 2.0 (TB2) is a benchmark of 89 diverse command-line tasks spanning system administration, scientific computing, security, data science, and software engineering (E.g., configure web servers, Bayesian networks analysis, ML model training, git history sanitization, and compile software from source). Each task provides a Docker container with a pre-configured environment, a natural language task description, and a hidden test script that validates the solution. Task difficulty ranges widely — from 5-minute file edits to 60-minute multi-step system configurations — with per-task timeouts.
1.2 Agent Architecture
Our solver agent (i.e., Coding agent) employs two agent frameworks: (1) Strands Agents + AgentCore; (2) A standard ReAct loop + Sandbox/Docker. The solver has access to core tools:
bash(): Execute shell commands in the Docker container (60s timeout per call)python(): Execute Python code in an isolated interpretersubmit("DONE"): Signal task completion
The system prompt follows an Analysis/Plan format that structures the agent’s reasoning before each tool call.
1.3 The Evolution Framework (A-Evolve)
A-Evolve deploys a trajectory-only in-situation evolution loop as follows:
Solver Agent (bash/python/submit tools)
↓ solves tasks in Docker containers
Trajectory Observer (JSONL logs)
↓ records tool calls, outputs, ...
LLM Judge (scores 0-10 without ground-truth)
↓ evaluates trajectories/identify potential failure patterns/...
Evolver Agent (LLM agent accessing bash tool to modify the agent harness)
↓ mutates prompts/skills/tools/memories based on observations
Skill Library (lazy-loading)/Prompt/Tool Registry/Memory
↓ agent reads skills on demand/loads updated prompts/...
Next Batch's Solver
The key design constraint is that evolution operates without ground-truth feedback. The evolver never sees test results — it must infer what went wrong from the trajectory alone. This makes the problem significantly harder than test-guided optimization.
1.4 Experiment Metrics
For solver agent, we measure: (1) pass rate – resolved/total (%); (2) AVG turns; (3) AVG tool calls; (4) AVG used of skills; (5) Running time. For evolver agent, we measure: (1) number of created skills/tools/memories; (2) Changed prompt/memory chars.
1.5 Experiment Notation
We conduct our experiments in a principled way to systematically build a self-evolving coding agent. Particularly, we design the following chronologically thematic groups. Each group is labeled with a letter prefix, and individual experiments within a group are numbered sequentially:
| Group | Focus | Experiments | Section |
|---|---|---|---|
| A | Baseline establishment | A1–A2 | Section 2 |
| B | Ceiling analysis and task classification | B1a–B1c, B2, B3 | Section 3 |
| C | Harness component exploration (prompt, memory, tools, skills, combinations) | C1a–C1b, C2a–C2b, C3a–C3c, C4a–C4b, C5a–C5g | Section 4 |
| D | Focused skill evolution strategies (size, count, generation, filtering, variance) | D1a–D1g, D2a–D2d, D3a–D3d, D4a–D4d, D5a–D5e | Section 5 |
| E | Self-verification test-time scaling | E1a–E1f, E2a–E2c, E3a–E3c | Section 6 |
| F | Multi-Agent System (MAS) Evolver | F1a–F1d, F2a–F2c | Section 7 |
1.6 Infrastructure: API reliability and context management
Our agent infrastructure addresses two key challenges: API reliability and context management. For API reliability, running 20 concurrent AWS Bedrock API calls requires robust error handling — we implement transient error retries (covering read timeouts, content filtering false positives, and connection hangs), timeout clock pausing during retry waits so that idle retry time does not consume the task’s budget, and a wall-clock safety cap to prevent infinite retry loops on persistent outages. For context management, we employ a summarization-based approach: when the conversation reaches the context window limit, we summarize the preceding conversation history into a condensed representation and continue from that summary, preserving critical state while freeing context capacity for continued reasoning.
2. Establishing the Baseline (Group A)
2.1 Opus 4.5 Baseline (A1)
Our initial experiments used Claude Opus 4.5. The vanilla ReAct coding agent scored AVG 60.15% across 5 runs (For simplicity, by default, all results are reported as AVG in 5 runs while some use 3 runs due to the API quota limit).
| Metric | A1: Opus 4.5 |
|---|---|
| Score | AVG 60.15% |
| Mean turns (all tasks) | 72.84 |
| Mean input tokens/task | 1,081,840 |
| Mean output tokens/task | 15,689 |
| Prompt length | 1,368 chars |
| Skills loaded | 0 (no loading mechanism) |
2.2 Opus 4.6 Baseline (A2)
When Claude Opus 4.6 was released (February 5, 2026), we used the same ReAct workflow and observed a significant capability jump:
| Metric | A1: Opus 4.5 | A2: Opus 4.6 | Change |
|---|---|---|---|
| Score | AVG 60.15% | AVG 65.91% | +5.76pp |
| Mean turns (all) | 72.84 | 60.68 | -16.70% |
| Mean input tokens/task | 1,081,840 | 670,784 | -38.01% |
| Mean output tokens/task | 15,689 | 13,910 | -11.34% |
| Prompt length | 1,368 chars | 1,368 chars | Same |
| Skills loaded | 0 | 0 | Same |
Finding: Opus 4.6 solves tasks with 38.01% fewer input tokens and 16.70% fewer turns, while achieving a higher pass rate. The model is substantially more efficient and capable. In particular, we find Opus 4.6 will automatically self-verify the results.
3. Ceiling Analysis and Task Classification (Group B)
Before attempting to improve performance, we needed to (1) estimate the upper bound of the performance by computing pass@5 from A2 (B1a), increasing the per-task running time limit (B1b), and providing the environment signal in the ReAct loop through comparing the generated artifact and the ground-truth results (B1c). (2) understand the benchmark’s structure: which tasks are deterministically solvable, which are beyond the model’s capability, and which are stochastic. Then, we focus on these stochastic flaky tasks for improvement.
3.1 Ceiling Analysis (B1)
| Experiment | Design details | Performance |
|---|---|---|
| B1a | Pass@5 across 5 independent A2 runs (upper bound of flaky stabilization) | 85.88% |
| B1b | 2× per-task timeout (1,800s–7,200s) | 68.12% |
| B1c | Environment signal: compare generated artifact against ground-truth output in the ReAct loop | 66.23% |
Finding: From B1a, we obtained the north star of Terminal-Bench 2.0 to guide our exploration (i.e., 85.88%). Meanwhile, we found providing environment signals does not help much in the coding tasks from B1c. From B1b, increasing the running time can help by ~3pp, indicating TB2 is a resource-heavy benchmark for evaluation.
3.2 Task Stability Classification (B2)
Across 5 independent Opus 4.6 runs, we classified all 89 tasks by their pass rate:
| Category | Count | Pass Rate | Description |
|---|---|---|---|
| Always pass | 52 | 100.00% | Deterministic — model capability sufficient |
| Flaky (40–80%, unstable) | 11 | Pass 2-4/5 times | Likely solvable with small improvements |
| Flaky (20% Rare pass) | 10 | Pass 1/5 times | hard but occasionally solvable |
| Always fail | 16 | 0.00% | Beyond current capability |
Finding for the theoretical ceiling: If all flaky tasks could be stabilized, the score would reach 85.88%. The 16 always-fail tasks represent a hard capability boundary.
3.3 Failure Pattern Taxonomy (B3)
Deep trajectory analysis — comparing successful and failed runs of the same task side by side — revealed distinct failure patterns as Finding:
| Pattern | Frequency | Description | Challenge |
|---|---|---|---|
| Premature convergence | High | Model locks into a wrong approach early and never backtracks | Moderate |
| Semantic reasoning error | Medium | Model misreads task requirements | Moderate |
| Hidden eval constraints | Medium | Task has implicit requirements not in the description | Low |
| Algorithm stochasticity | High | Equally plausible approaches differ in robustness | High |
These failure patterns help guide our next-stage experiment design for harness improvement (i.e., workflow, prompt, memory, tool, and skill).
4. Exploring Improvement Directions (Group C)
With the baseline, north star, and flaky tasks identified, we systematically explored four workspace components: tools, memory, system prompt, and skills.
4.1 System Prompt: Behavioral Strategy Evolution (C1)
We restricted the evolver to modifying only the system prompt to address the failure cases in B3, e.g., adding behavioral strategy rules like “Comprehensively understand the task and think various directions before starting to solve the problem.”.
| Experiment | Approach | Score | Prompt Length |
|---|---|---|---|
| C1a | Prompt-only evolution with lightweight guidance | 66.02% | 1,368 -> 1,940 chars |
| C1b | Prompt-only evolution with tailored guidance from B3 | 60.13% | 1,368 -> 4,940 chars |
Finding: C1a scored identically to vanilla A2 while longer and complex prompts hurt in C1b. The strategy rules are qualitatively sensible but quantitatively inert — Opus 4.6 already implements these strategies implicitly.
4.2 Memory: Episodic Learning Across Tasks (C2)
We tested memory injection — episodic observations from previous tasks used in future tasks by full loading. Memories accumulated across evolution cycles without providing task-relevant information.
| Experiment | Approach | Score | Memory Length |
|---|---|---|---|
| C2a | Memory-only evolution with memory budgets | 65.11% | 1,431 chars |
| C2b | Memory-only evolution without limitations | 60.46% | 2,770 chars |
Finding: Memory injection was consistently neutral or slightly negative. The evolver stored generic process observations (“always submit your answer,” “save intermediate results”) — advice the model already knows.
4.3 Tools: Universal vs Task-Specific (C3)
We tested the impact of adding pre-defined tools and self-generated tools. The pre-defined tool set is Bash, Python, and Submit. In the self-evolving stage, the evolver agent generates task-specific tools from the task trajectories.
| Experiment | Approach | Score | Notes |
|---|---|---|---|
| C3a | Bash + Python | 58.12% | |
| C3b | Bash + Python + Submit | 65.91% | the same setting as A2 |
| C3c | Bash + Python + Submit + Self-generated tools (e.g., SecurityFileCheck(), Verify()) | 59.37% |
Finding: (1) Submit tool helps because agent does not know when to stop. In the trajectory in C3a, we found agent sometimes continus solving the problem without submitting the results; (2) Self-generated task-specific tools hurt in the terminal/coding tasks. The similar coding tasks have been heavily trained in post-training stages only using universal tools (e.g., grep, bash). (3) The verification logic needs to be richer than what a tool interface naturally supports. The agent often called verify() but performed superficial checks.
4.4 Skills: The Most Promising Direction (C4)
We examine the influence of self-generated skills during evolutions. Skills serve as the domain-specific knowledge and provide guidance for agents when they solve similar tasks. These skills are used via lazy loading.
| Experiment | Approach | Score | Skill length |
|---|---|---|---|
| C4a | Skill-only evolution with skill count budgets | 70.58% | 3,012 characters |
| C4b | Skill-only evolution without budgets | 68.23% | 8,122 characters |
Finding: (1) In C4a, after evolver generated 3 seed skills, the agent chose to read at least one skill on 32.94% of tasks in the remaining tasks. (2) Even with more skills in C4b, we still witness consistent gain from skills. This indicates that skills delivered a measurable improvement when the skills were well-targeted, unlike tools, memory, or prompt changes.
4.5 Auto-Harness: The combination of prompt, tool, memory, and skill (C5)
Furthermore, we test the combination of agent harness. Based on the experiments C1-C4, we know skills help most and the lazy-loading mode will not hurt the performance much. On top of the skill, we explore different harness combination for deep-dive.
| Experiment | Approach | Score | Note |
|---|---|---|---|
| C5a | Skill + Prompt | 67.11% | |
| C5b | Skill + Tool | 66.23% | |
| C5c | Skill + Memory | 67.02% | |
| C5d | Skill + Prompt + Tool | 69.13% | |
| C5e | Skill + Prompt + Memory | 67.29% | |
| C5f | Skill + Tool + Memory | 64.19% | |
| C5g | Skill + Prompt + Tool + Memory | 65.19% |
Finding: From the series of experiments in C5, we found that skills+{Prompt, tool, memory} hurts except C5d. This indicates that for the terminal bench 2.0, we should focus on the skill evolution for the north star shown in B1 experiment.
5. Skill Evolution Strategies (Group D)
For skill, it employ the lazy loading strategy. Particularly, we only list only skill names and one-line descriptions in the system prompt. The agent calls read_skill(name) to load full content on demand.
5.1 Skill Size Optimization (D1)
We examine the impact of skill size:
| Experiment | Approach | Score | Note |
|---|---|---|---|
| D1a | 0~500 chars | 67.21% | |
| D1b | 500~1000 chars | 66.73% | |
| D1c | 1000~1500 chars | 68.91% | |
| D1d | 1500~2000 chars | 71.23% | |
| D1e | 2000~2500 chars | 70.02% | |
| D1f | 2500~3000 chars | 68.43% | |
| D1g | 3000~3500 chars | 67.19% |
Finding: Skill size exhibits a clear sweet spot at 1,500–2,000 characters (D1d, 71.23%). Skills shorter than 1,000 characters lack sufficient domain detail to be useful, while skills exceeding 2,500 characters dilute the signal with redundant content. The drop-off is roughly symmetric: -4.02pp below the sweet spot (D1a) and -4.04pp above it (D1g). This guides our skill creation constraints in subsequent experiments.
5.2 Skill count budget and Per-Task Keyword Selection (D2)
We investigate the number of added skills. Besides, we also have our mitigation: match skill keywords against the task description, only inject skills with top-k relevance score ranked by N-gram overlapping.
| Experiment | Skill Delivery | Score | |
|---|---|---|---|
| D2a | with skill count budgets of 3 | 70.58% | |
| D2b | with skill count budgets of 6 | 72.21% | |
| D2c | without skill count budgets | 68.23% | |
| D2d | without skill count budgets but with per-task keyword selection | 71.94% |
Finding: A moderate skill budget of 6 (D2b) achieves the best score at 72.21%, outperforming both the tighter budget of 3 (D2a, 70.58%, with ratio of tasks using skills as 32.94%) and the unbounded setting (D2c, 68.23%). Unbounded skill injection degrades performance because the agent loads irrelevant skills that distract from the task. Per-task keyword selection (D2d, 71.94%) partially mitigates the unbounded problem by filtering skills for relevance, recovering most of the gain without a hard count limit.
5.3 Solver-Proposed Skills (D3)
We tested having the solver (with full environment context) propose skill drafts after each task, with the evolver judging the skills and curating proposals.
| Experiment | Skill Generation | Score | Note |
|---|---|---|---|
| D3a | Evolver creates from trajectories | 72.21% | |
| D3b | Solver creates, evolver refines | 67.35% | |
| D3c | Solver creates, evolver generalizes | 69.16% | |
| D3d | Evolver creates and evolver judges and refines | 70.37% |
Finding: “Solver creates, evolver refines” (D3b) has the lowest score. The solver excels at extracting task-specific knowledge but fails at generalization. The evolver merged hyper-specific drafts into existing skills, replacing useful content with narrow tips like “ICS calendar parsing” and “OCR configuration.” “Solver creates, evolver generalizes” (D3c) partially recovered by instructing the evolver to “never shrink” existing skills and to generalize across them. However, the best result still comes from the evolver creating skills directly from trajectories (D3a). Lesson: The bottleneck is not knowledge extraction but knowledge generalization — the solver cannot reliably identify what transfers to unseen tasks.
5.4 Case study: The Process-Advice Trap (D4)
When we analyzed the long-horizon trajectory and read the skill content in D1-D3 experiments, we found that the evolver gravitates toward generic process advice. Without constraints, it produces skills like “always check if packages are installed” and “handle timeouts by trying alternatives.” We introduced a FORBIDDEN/REQUIRED blocklist as follows:
FORBIDDEN: Timeout handling, pip tips, session persistence, generic debugging
REQUIRED: Domain-specific libraries, verification steps, domain pitfalls ONLY
| Experiment | Skill guidance | Score | Note |
|---|---|---|---|
| D4a | Generic skill creation guidance in evolver agent | 72.21% | |
| D4b | + negative prompting | 73.17% | |
| D4c | + positive steering | 70.36% | |
| D4d | + negative prompting + positive steering | 73.21% |
Finding: Negative prompting (D4b, blocking bad outputs) was more effective than positive steering (D4c). The combination (D4d) achieves the highest score but only marginally above negative prompting alone (+0.04pp), suggesting most of the gain comes from the blocklist. Without the blocklist, skills accumulate generic process advice; with it, they stay focused on domain-specific knowledge.
5.5 Case Study: Skill Evolution Non-Determinism (D5)
Similar to the non-determinism in LLM response generation and agent execution, we examine the the non-determinism in skill evolution. The five evolution runs from the identical seed. The experiment is shown below.
| Experiment | Different content of exampler debugging skill | Different content of scientific computing skill | Score |
|---|---|---|---|
| D5a | Protobuf API migration | ML GPU detection | 69.65% |
| D5b | ICS file writing | OCR/tesseract | 71.30% |
| D5c | constraint solvers | KL divergence | 72.21% |
| D5d | C/C++ memory debugging (valgrind) | calendar scheduling | 73.12% |
| D5e | async cancellation | logic circuits | 74.77% |
Finding: Five evolution runs from identical seeds produced substantially different skill content — each evolver wrote a different “textbook” based on which tasks happened to fail in each batch. Despite the content divergence, scores ranged from 69.65% to 74.77%, indicating that multiple distinct skill sets can be effective. This implies that evolved skills are samples from a large space of possibly-useful knowledge, heavily path-dependent on batch ordering and stochastic task outcomes. Ensembling diverse skill sets may be a promising future direction.
6. Self-Verification: Breaking the Ceiling (Group E)
For motivation, our error analysis (Section 3.3) revealed that many flaky task failures stem from unchecked assumptions rather than missing knowledge. The agent solves the task correctly most of the time but occasionally submits without verifying a critical decision — wrong normalization, broken package manager, overfitting to the example.
6.1 Ablation Study: Various Verification Approaches (E1)
We tested five verification delivery mechanisms on the 21 flaky tasks (Opus 4.6, no evolution):
| Experiment | Approach | Score | Change | Avg Turns | Total Tokens |
|---|---|---|---|---|---|
| E1a | Baseline (no verification) | 33.33% | — | 77.8 | 24.34M |
| E1b | Two-step verification-submit workflow | 42.86% | +9.52pp | 78.1 | 25.34M |
| E1c | Optional verify() tool | 47.62% | +14.29pp | 91.0 | 46.46M |
| E1d | Self-verification skill | 66.67% | +33.33pp | 82.0 | 32.29M |
| E1e | Enhanced system prompt | 52.38% | +19.05pp | 69.5 | 18.87M |
| E1f | Skill + prompt combo | 42.86% | +9.52pp | 71.2 | 22.05M |
The self-verification skill (E1d) was the clear winner. Critically, the combination of skill + prompt (E1f) performed worse than either alone — a counterintuitive result with an important explanation.
Why the Combination Hurts
The skill works because it’s lazy-loaded: the model reads it only when uncertain. The prompt instruction forces verification on every task, including those where the model is already confident. When both are present:
- The prompt forces a verification step the model wouldn’t naturally take
- The model also loads the skill (because it’s now in verification mode)
- Redundant double-verification wastes 2-3 turns per task
- The aggregate overhead costs more tasks than extra checking saves
Lesson: Let the model self-select when to verify rather than forcing it universally.
6.2 Scaling to Full Benchmark: The Initial Disappointment (E2)
The ablation showed the gain of self-verification skill on 21 flaky tasks. Then, we ran the same configuration on all 89 tasks with various self-verification skills.
| Experiment | Skill setup | Score | Note |
|---|---|---|---|
| E2a | No predefined seed skills | 72.21% | |
| E2b | Generic self-verification skill as seed skill | 70.16% | |
| E2c | Curated self-verification skill as seed skill | 74.27% |
Finding: Without any predefined skills (E2a), the baseline already achieves 72.21%. Adding a generic self-verification skill (E2b) actually hurts by 2.05pp — the skill contained advice like “check edge cases,” “test empty input,” “verify permissions,” which is knowledge the model already has (the same process-advice trap from Section 5.4). Only when we curated the skill with targeted assumption-checking content (E2c) did verification deliver a +2.06pp gain. This confirms that what the skill says matters far more than that a skill exists.
Guided by failure pattern analysis (Section 3.3), we replaced the generic advice with targeted assumption-checking:
Before (generic):
## 3. Edge Cases
- Check with empty input, special characters, large input
- Verify error handling works
- Check permissions (chmod, ownership)
After (targeted):
## 3. Check Your Assumptions
- If you chose between multiple approaches (e.g., normalization methods),
verify your choice matches what the task/test expects
- If you installed or modified system packages, verify the package manager
still works
- If your solution works on the provided example, consider whether it
generalizes to different inputs
Each bullet addresses a real failure mode:
- Normalization choice: pytorch-model-cli failure (chose MNIST mean/std instead of raw /255)
- Package manager state: dna-assembly failure (apt-get timeout corrupted dpkg)
- Generalization: video-processing failure (MOG2 overfit to example video)
We also enhanced the server verification: “curl or wget to test endpoints — verify response content, not just that the port is open. Checking ports with ss or netstat is NOT enough.” This addresses install-windows-3.11 where the agent verified port 80 was open but never tested nginx content.
6.3 The Nudge: Directing the Model to Read the Skill (E3)
With improved content, we added a single line to the system prompt: “After you think you have completed the task, read the self-verification skill to verify your solution.”
| Experiment | Nudge | Skill Content | Skill Loads | Score |
|---|---|---|---|---|
| E3a | None | Self-verification skill | 24.71% | 74.27% |
| E3b | Verification skill | generic | 77.65% | 68.24% |
| E3c | Verification skill | targeted | 90.59% | 76.47% |
Finding: The system prompt nudge dramatically increases skill loading frequency — from 24.71% (E3a) to 77.65%–90.59% (E3b/E3c). However, loading frequency alone does not guarantee improvement: E3b loads the skill on 77.65% of tasks yet drops performance by 6.03pp because the generic content wastes turns on checks the model already performs. E3c, with targeted content and 90.59% loading, achieves the best result at 76.47%. Lesson: Content quality matters more than loading frequency. The improvement comes from making each verification round productive (catches real bugs) rather than wasteful (checks things the model already knows).
7. Multi-Agent System Evolver (Group F)
Our single-agent evolver (used throughout Groups C–E) handles analysis, skill authoring, and quality control within one context window. This creates path dependence (D5) and forces a single agent to context-switch between fundamentally different cognitive tasks. We decomposed skill evolution into four specialized LLM agents — Orchestrator, Analyst, Author, and Critic — called MAS-Evolver.
MAS-Evolver uses an orchestrator-dispatch pattern on the Strands framework:
| Agent | Role | Tools | Output |
|---|---|---|---|
| Orchestrator | Coordinate evolution cycle | analyze, author_skill, critique_skill, workspace_bash | Skills written to workspace |
| Analyst | Identify failure patterns from compressed trajectories | None (pure reasoning) | JSON failure patterns |
| Author | Create candidate skills (1,500–2,000 chars) | None (pure reasoning) | SKILL.md candidate |
| Critic | Adversarial review | None (pure reasoning) | Verdict: APPROVE / REVISE / REJECT |
Key design decisions: (1) subagents are pure-reasoning with no tools — only the orchestrator mutates the workspace; (2) strict information boundaries prevent bias: the Analyst never sees existing skills, the Author never sees raw trajectories, and the Critic never sees the Author’s reasoning.
7.1 MAS Architecture Comparison (F1)
We compared three multi-agent coordination patterns against vanilla. The orchestrator pattern uses a central coordinator that dynamically sequences subagent calls with an iterative author–critic revision loop (up to 2 rounds per skill). The DAG pattern arranges agents in a fixed directed acyclic graph (Analyst → Author → Critic) with no revision cycles. The swarm pattern allows agents to self-organize without centralized control.
| Experiment | System | Accuracy | Note |
|---|---|---|---|
| F1a | Vanilla Opus 4.6 (= A2) | 65.91% | — |
| F1b | Orchestrator MAS design | 69.28% | Dynamic sequencing + author–critic loop |
| F1c | DAG MAS design | 67.31% | Fixed pipeline, no revision cycles |
| F1d | Agent Swarm MAS design | 68.45% | Decentralized, no centralized control |
Finding: The orchestrator pattern (F1b, 69.28%) outperforms both DAG (F1c, -1.97pp) and swarm (F1d, -0.83pp). The DAG’s fixed pipeline cannot revise rejected skills — once the Critic rejects, the skill is lost. The swarm lacks centralized quality control, producing more skills but with inconsistent quality. The orchestrator’s iterative author–critic loop is the key differentiator: in Cycle 1, MAS-Evolver produced new skills (e.g., numerical-algorithm-implementation — covering log-space arithmetic in R, NaN/Inf guards, and adaptive rejection sampling.). The Critic approved it after one revision round. Cycles 2–4 repeated the process, creating additional skills until the budget was met.
7.2 Context Management in MAS (F2)
With the orchestrator architecture fixed from F1, we investigated how trajectory compression affects skill quality since the context management and agent communication is the key in MAS. Each evolution cycle processes batch trajectories that can be compressed for subagent. We compared three strategies: adaptive compression that adjusts granularity based on content complexity and remaining budget (e.g., context window limit by LLM model); whole-trajectory compression that keeps the task description intact and uniformly compresses the entire trajectory; and sliding-window compression that retains the task description and compresses only within a fixed recent window.
| Experiment | System | Accuracy | Note |
|---|---|---|---|
| F2a | Raw trajectory with adaptive compression based on content and budget | 72.94% | ~2K chars/task |
| F2b | Kept task description + whole trajectory compression | 70.31% | |
| F2c | Kept task description + sliding window compression | 68.91% |
Finding: Adaptive compression (F2a, 72.94%) outperforms both whole-trajectory (F2b, -2.63pp) and sliding-window (F2c, -4.03pp) compression. Whole-trajectory compression discards too much detail uniformly — critical failure signals (e.g., a wrong normalization choice at turn 12) get compressed away alongside irrelevant boilerplate. Sliding-window compression is worse because it systematically drops early-trajectory context where task understanding and initial approach decisions occur, which are precisely the signals the Analyst needs. Adaptive compression preserves high-information-density segments (errors, decision points, tool output anomalies) while compressing routine segments (package installs, file reads) when the context window limit is reached. Overall, MAS-Evolver with adaptive compression achieves +7.03pp over vanilla, comparable to the single-agent skill evolution baseline (D2b: 72.21%).
8. Results Summary
8.1 Full Score Progression
Below we present all experiments organized by group. Detailed analysis for each experiment is in its respective section.
Group A: Baselines (Section 2)
| Experiment | Description | Score |
|---|---|---|
| A1 | Opus 4.5 vanilla ReAct agent | 60.15% |
| A2 | Opus 4.6 vanilla ReAct agent | 65.91% |
Group B: Ceiling Analysis (Section 3)
| Experiment | Description | Score |
|---|---|---|
| B1a | Pass@5 across 5 independent A2 runs | 85.88% |
| B1b | 2× per-task timeout | 68.12% |
| B1c | Environment signal (artifact vs ground-truth) | 66.23% |
Group C: Harness Component Exploration (Section 4)
| Experiment | Description | Score |
|---|---|---|
| C1a | Prompt-only evolution (lightweight) | 66.02% |
| C1b | Prompt-only evolution (tailored from B3) | 60.13% |
| C2a | Memory-only evolution (with budgets) | 65.11% |
| C2b | Memory-only evolution (unbounded) | 60.46% |
| C3a | Bash + Python (no Submit tool) | 58.12% |
| C3b | Bash + Python + Submit (= A2) | 65.91% |
| C3c | + Self-generated tools | 59.37% |
| C4a | Skill-only evolution (budgeted) | 70.58% |
| C4b | Skill-only evolution (unbounded) | 68.23% |
| C5a | Skill + Prompt | 67.11% |
| C5b | Skill + Tool | 66.23% |
| C5c | Skill + Memory | 67.02% |
| C5d | Skill + Prompt + Tool | 69.13% |
| C5e | Skill + Prompt + Memory | 67.29% |
| C5f | Skill + Tool + Memory | 64.19% |
| C5g | Skill + Prompt + Tool + Memory | 65.19% |
Group D: Skill Evolution Strategies (Section 5)
| Experiment | Description | Score |
|---|---|---|
| D1a | Skill size 0–500 chars | 67.21% |
| D1b | Skill size 500–1,000 chars | 66.73% |
| D1c | Skill size 1,000–1,500 chars | 68.91% |
| D1d | Skill size 1,500–2,000 chars | 71.23% |
| D1e | Skill size 2,000–2,500 chars | 70.02% |
| D1f | Skill size 2,500–3,000 chars | 68.43% |
| D1g | Skill size 3,000–3,500 chars | 67.19% |
| D2a | Skill count budget = 3 | 70.58% |
| D2b | Skill count budget = 6 | 72.21% |
| D2c | No skill count budget | 68.23% |
| D2d | No budget + per-task keyword selection | 71.94% |
| D3a | Evolver creates from trajectories | 72.21% |
| D3b | Solver creates, evolver refines | 67.35% |
| D3c | Solver creates, evolver generalizes | 69.16% |
| D3d | Evolver creates and judges/refines | 70.37% |
| D4a | Generic skill creation guidance | 72.21% |
| D4b | + FORBIDDEN blocklist (negative prompting) | 73.17% |
| D4c | + positive steering | 70.36% |
| D4d | + negative prompting + positive steering | 73.21% |
| D5a–D5e | 5 runs from identical seed (skill variance) | 69.65%–74.77% |
Group E: Self-Verification (Section 6)
| Experiment | Description | Score | Note |
|---|---|---|---|
| E1a | Baseline — no verification (21 flaky tasks) | 33.33% | Flaky subset only |
| E1b | Two-step verification-submit workflow | 42.86% | Flaky subset only |
| E1c | Optional verify() tool | 47.62% | Flaky subset only |
| E1d | Self-verification skill | 66.67% | Flaky subset only |
| E1e | Enhanced system prompt | 52.38% | Flaky subset only |
| E1f | Skill + prompt combo | 42.86% | Flaky subset only |
| E2a | No predefined seed skills (full benchmark) | 72.21% | |
| E2b | Generic self-verification skill | 70.16% | |
| E2c | Curated self-verification skill | 74.27% | |
| E3a | No nudge, targeted skill | 74.27% | 24.71% skill load rate |
| E3b | Nudge + generic skill | 68.24% | 77.65% skill load rate |
| E3c | Nudge + targeted skill | 76.47% | 90.59% skill load rate |
Group F: Multi-Agent System Evolver (Section 7)
| Experiment | Description | Score | Note |
|---|---|---|---|
| F1a | Vanilla Opus 4.6 (= A2) | 65.91% | — |
| F1b | Orchestrator MAS design | 69.28% | Dynamic sequencing + author–critic loop |
| F1c | DAG MAS design | 67.31% | Fixed pipeline, no revision cycles |
| F1d | Agent Swarm MAS design | 68.45% | Decentralized, no centralized control |
| F2a | Adaptive compression | 72.94% | ~2K chars/task |
| F2b | Whole trajectory compression | 70.31% | |
| F2c | Sliding window compression | 68.91% |
Key Progression Summary
The path from baseline to best result:
| Stage | Key Innovation | Score | Δ from A2 |
|---|---|---|---|
| A1 | Opus 4.5 baseline | 60.15% | — |
| A2 | Opus 4.6 baseline | 65.91% | — |
| C4a | Skill-only evolution (budgeted) | 70.58% | +4.67pp |
| D2b | Skill count budget = 6 | 72.21% | +6.30pp |
| F2a | MAS-Evolver (multi-agent skill evolution) | 72.94% | +7.03pp |
| D4d | FORBIDDEN blocklist | 73.21% | +7.30pp |
| E2c | Curated verification skill | 74.27% | +8.36pp |
| E3c | Targeted verification + nudge | 76.47% | +10.56pp |
8.2 Leaderboard
| System | Score | Notes |
|---|---|---|
| Opus 4.6 (A2) | 65.91% | |
| MAS-Evolver (F2a) | 72.94% | |
| Copy (Top #7) | 75.3% | |
| A-Evolve (E3C) | 76.47% | |
| ForgeCode (Top #1) | 81.8% |
9. Key Learnings
What Works
Lazy skill loading. Let the model decide which skills to read. Without nudging, the model self-selects on ~25–33% of tasks (E3a, C4a); with a targeted nudge, up to 91% (E3c). System prompt stays at ~1,900 characters instead of 15,000+.
Targeted self-verification. A “Check Your Assumptions” wording in self-verification skill that verifies decision choices (normalization method, algorithm selection, package state) rather than generic edge cases. Content quality drives the improvement — generic verification hurts (E2b, -2.05pp) while targeted verification helps (E3c, +10.56pp).
FORBIDDEN blocklist. Blocking generic process advice forces domain-specific skill creation (D4b, +0.96pp over unblocked D4a). Negative prompting outperforms positive steering (D4b vs D4c).
Minimal prompt. The 1,368-character base prompt is near-optimal (C1b shows longer prompts hurt). Total budget including nudges: ~1,485 characters. Every addition must be weighed against attention dilution.
What Doesn’t Work
Solver-proposed skills. The solver extracts task-specific knowledge but fails at generalization (D3b: -4.86pp vs D3a).
Meta-skills. A “systematic-exploration” skill was loaded on only 1.18% of tasks. The model does not self-assess when it is stuck.
Unbounded skill injection. Removing skill count budgets degrades performance (D2c: 68.23% vs D2b: 72.21%) even with lazy loading, because the agent loads irrelevant skills.
Design Insights
Error analysis drives improvement. The +9.41pp gain came from trajectory analysis → failure pattern identification → targeted skill content.
Simple scaffold wins. Best configuration: minimal prompt (1.5K) + 5 focused skills (8.5K) + lazy loading + no evolution.
Multi-agent decomposition validates but doesn’t yet surpass. MAS-Evolver (72.94%) matches the single-agent baseline (D2b: 72.21%) while producing adversarially reviewed skills. The quality gate works — skills pass Critic review — but the architecture has not yet incorporated the refinements (blocklist, targeted verification) that drive the single-agent system to 76.47%.
10. Future Directions
A-Evolve+: Richer signal beyond in-situation evolution. The trajectory-only evolver sees compressed logs without pass/fail. Richer feedback (e.g., the pass/fail result, verifiable feedback) could improve skill quality.
