How Good Are GLM5.2, Claude Fable 5, and Others
How Good Are GLM5.2, Claude Fable 5, and Others: Create Your Own Benchmark and Revisit Your Training Recipe
Update #2: Upon GLM5.2’s release, we self-hosted the BF16 version on a 4-node H200 cluster for benchmark evaluation. Configuration details are in the Appendix.
Update #1: Upon Claude Fable 5’s release, we evaluated it immediately at the default
highreasoning effort via the Claude API. After completing all 250×5 pass@5 episodes, model access became unavailable — we were unable to test higher reasoning efforts (xhigh, max) for this model.
1. Summary
We present a systematic evaluation of 12 model configurations on a self-created vertical-domain agentic multi-turn coding benchmark. The benchmark is fully self-owned with verifiable ground-truth answers, testing LLMs’ ability to solve complex problems through iterative tool use (Python, Bash, SQL, local-host KB search, task submission) within self-contained Docker environments. A ReAct-based agent loop serves as the evaluation harness (consistent with our TRAJECT-Bench framework), producing pass@1 and pass@5 accuracy scores for rigorous cross-model comparison.
Our evaluation yields four main findings:
Benchmark validation. We evaluate four models spanning 35B to 744B total parameters (Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, GLM5.2-744B-A40B). Accuracy scales monotonically from 0.173 to 0.456 with increasing active parameters, confirming the benchmark’s discriminative power without floor or ceiling effects.
Closed-source frontier evaluation. Testing 6 frontier models from Anthropic (Claude Fable 5, Opus 4.6/4.7/4.8), OpenAI (GPT-5.5, GPT-5.4), and Google (Gemini 3.5 Flash) at multiple reasoning effort levels, we find that Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676). Reasoning effort scaling is model-specific — some models gain +5 points at maximum effort while others degrade.
GLM5.2-744B-A40B positioning. With 0.456 pass@1 and 0.592 pass@5, GLM5.2 ranks 5th on pass@1 and 3rd on pass@5 — surpassing Gemini 3.5 Flash, GPT-5.5, and Claude Opus 4.8. In head-to-head comparison with Claude Fable 5, it matches on 77% of tasks, with gaps of 9.7 points (pass@1) and 8.4 points (pass@5).
Training recipe insights. Cross-referencing closed-model behavior with our own RL experiments reveals: (a) PPO-trained models exhibit more exploratory multi-turn behavior than GRPO/DAPO-trained models, consistent with GLM5.2’s pattern; (b) small models (35B) suffer severe catastrophic forgetting under aggressive RL despite strong task accuracy; (c) larger models (122B+) tolerate RL with only ~2% general capability degradation.
2. Introduction
2.1 Motivation: Why Create Your Own Benchmark?
Public benchmarks face two well-documented challenges: data contamination (models may have seen test data during pre-training) and benchmark-maxing (developers optimize specifically for known benchmarks rather than general capability). Our prior work, TRAJECT-Bench (ICLR 2025), introduced trajectory-aware evaluation of LLM tool use with fine-grained diagnostics — tool selection correctness, argument accuracy, and dependency satisfaction. As a published benchmark, however, it is susceptible to leakage over time.
To obtain reliable, uncontaminated evaluation signals, we created a self-created vertical-domain agentic multi-turn coding benchmark with the following properties:
- Fully private and self-owned — never published, shared externally, or included in any training corpus, eliminating contamination risk entirely.
- Verifiable results — each task has a definitive ground-truth answer, enabling objective pass@1 and pass@5 accuracy measurement without reliance on model-based judges.
- Multi-turn agentic coding — models must autonomously decide which tools to invoke (Python, Bash, SQL, local-host KB search), write and execute code within Docker environments, interpret intermediate results, and iteratively build toward a solution through multiple rounds of tool use.
- Multi-domain grounding — tasks span three vertical domains (code migration, data analytics, financial QA) with realistic complexity that requires genuine multi-step reasoning.
- Comprehensive metrics — four complementary evaluation signals: accuracy, format compliance, CoT reasoning quality, and investigation efficiency.
2.2 Benchmark Overview
Task domains. The benchmark spans three vertical domains, each contributing 100 candidate tasks (300 total):
- Code migration — translating code between programming languages and verifying output equivalence
- Data analytics — multi-step numerical reasoning over structured databases requiring query composition and interpretation
- Financial QA — quantitative reasoning over public companies’ latest earnings reports, involving multi-step calculations (ratios, growth rates, aggregations)
After manual quality review to ensure unambiguous ground-truth answers and appropriate difficulty calibration, 50 tasks were removed, yielding 250 tasks for downstream evaluation.
Task environment. Each task is instantiated in a self-contained Docker environment containing the necessary data, tables, and initial code artifacts. The model interacts with this environment through a set of tools:
- Python execution — write and run analytical code
- Bash — execute shell commands within the Docker container
- SQL query — retrieve structured data from knowledge bases
- Search — query a local-host knowledge-base search engine for relevant information
- Submit — signal task completion with the final answer
Agent harness. The evaluation employs a ReAct-based agent loop (Reasoning + Acting) — the same framework used in our TRAJECT-Bench (ICLR 2025) paper. At each step, the model observes the current state (previous tool outputs), reasons about what information is still needed, and selects the next tool action. The loop continues until the model submits a final answer or reaches the maximum step limit.
Evaluation metrics:
- Accuracy (pass@1, pass@5): Did the model arrive at the correct answer? Measured as exact match against verifiable ground truth. pass@1 measures average accuracy across 5 independent samples; pass@5 measures the model’s ceiling (best-of-5).
- Format quality: Does the response follow structured output requirements across multi-turn episodes?
- CoT reasoning quality: Are the model’s cited evidence points and reasoning chains relevant to the ground-truth answer?
- Average turns: Number of tool-call rounds used (measures investigation depth and efficiency).
Scale: 250 tasks × 5 independent samples per task (1,250 episodes total).
2.3 Connection to TRAJECT-Bench (ICLR 2025)
TRAJECT-Bench (ICLR 2025) evaluated trajectory-level tool-use quality — whether tools are selected, parameterized, and ordered correctly — in controlled settings. This benchmark extends that philosophy to end-to-end problem solving: models must orchestrate sustained multi-turn tool use to solve complex problems and produce verifiable final answers. The key advance is verifiability — every task has a deterministic correct answer derivable from the data, enabling objective accuracy measurement without LLM-as-judge evaluation.
3. Benchmark Validation: Scaling Behavior
A well-designed benchmark should produce scores that scale predictably with model capability. We validate this by evaluating four models with increasing active parameters: Qwen3.5-35B-A3B (3B active), Qwen3.5-122B-A10B (10B), Qwen3.5-397B-A17B (17B), and GLM5.2-744B-A40B (40B). Qwen3.5 models are evaluated as base checkpoints without fine-tuning; GLM5.2 is evaluated with its native thinking mode enabled.
3.1 Results
pass@1:
| Model | Total Params | Active Params | Accuracy | Format | CoT Quality | Turns |
|---|---|---|---|---|---|---|
| GLM5.2-744B-A40B | 744B | 40B | 0.456 | 0.927 | 0.510 | 23.1 |
| Qwen3.5-397B-A17B | 397B | 17B | 0.356 | 0.686 | 0.450 | 14.2 |
| Qwen3.5-122B-A10B | 122B | 10B | 0.239 | 0.479 | 0.313 | 17.2 |
| Qwen3.5-35B-A3B | 35B | 3B | 0.173 | 0.425 | 0.296 | 11.5 |
pass@5:
| Model | Total Params | Active Params | Accuracy | Format | CoT Quality | Turns |
|---|---|---|---|---|---|---|
| GLM5.2-744B-A40B | 744B | 40B | 0.592 | 0.967 | 0.532 | 22.1 |
| Qwen3.5-397B-A17B | 397B | 17B | 0.564 | 0.831 | 0.545 | 14.1 |
| Qwen3.5-122B-A10B | 122B | 10B | 0.452 | 0.751 | 0.493 | 17.0 |
| Qwen3.5-35B-A3B | 35B | 3B | 0.376 | 0.703 | 0.492 | 10.8 |
3.2 Analysis
Monotonic scaling confirms validity. Accuracy improves consistently with active parameters: 0.173 (3B) → 0.239 (10B) → 0.356 (17B) → 0.456 (40B), yielding approximately +8-12 points per 2× increase in active parameters. No floor effect (smallest model scores meaningfully) and no ceiling effect (largest model well below 1.0), confirming discrimination across a wide capability range.
Cross-family generalization. The scaling holds across two different model families (Qwen3.5 and GLM5.2), indicating the benchmark measures general agentic coding capability rather than family-specific artifacts.
Format quality correlates with scale. Structured output compliance improves from 0.425 to 0.927 — larger models better maintain instruction-following across long multi-turn episodes, a critical requirement for agentic deployment.
Investigation depth scales with capability. GLM5.2-744B-A40B uses the most turns (23.1) among all evaluated models, suggesting that more capable models naturally explore more deeply rather than guessing early. The benchmark rewards this thoroughness.
4. Closed-Source Frontier Model Evaluation
With the benchmark validated, we turn to frontier closed-source models. We evaluate 6 models across three providers (Anthropic, OpenAI, Google), systematically testing multiple reasoning effort configurations where the API supports it.
4.1 Results
We present the top 10 closed-source model configurations ranked by accuracy. Multiple configurations of the same model are included where we tested different reasoning effort levels.
pass@1:
| # | Model | Effort | Accuracy | Format | CoT Quality | Turns |
|---|---|---|---|---|---|---|
| 1 | Claude Fable 5 | high | 0.553 | 0.995 | 0.639 | 5.3 |
| 2 | Claude Opus 4.7 | max | 0.505 | 0.997 | 0.638 | 7.4 |
| 3 | Claude Opus 4.7 | high (default) | 0.466 | 0.999 | 0.634 | 5.6 |
| 4 | GPT-5.5 | xhigh | 0.459 | — | — | 5.1 |
| 5 | Claude Opus 4.6 | max | 0.457 | — | — | 19.4 |
| 6 | Claude Opus 4.8 | xhigh | 0.455 | — | — | 5.8 |
| 7 | GPT-5.5 | high | 0.448 | — | — | 4.7 |
| 8 | Claude Opus 4.6 | high | 0.444 | — | — | 19.6 |
| 9 | Claude Opus 4.8 | high (default) | 0.440 | 0.999 | 0.692 | 4.6 |
| 10 | Gemini 3.5 Flash | medium | 0.439 | 0.884 | 0.646 | 16.1 |
pass@5:
| # | Model | Effort | Accuracy | Format | CoT Quality | Turns |
|---|---|---|---|---|---|---|
| 1 | Claude Fable 5 | high | 0.676 | 0.999 | 0.643 | 5.4 |
| 2 | Claude Opus 4.7 | max | 0.620 | 0.999 | 0.638 | 7.1 |
| 3 | Claude Opus 4.6 | max | 0.576 | — | — | 19.2 |
| 4 | GPT-5.5 | xhigh | 0.570 | — | — | 4.0 |
| 5 | GPT-5.5 | high | 0.564 | — | — | 3.8 |
| 6 | Gemini 3.5 Flash | medium | 0.562 | 0.963 | 0.711 | 15.0 |
| 7 | Claude Opus 4.7 | high (default) | 0.556 | 0.999 | 0.637 | 5.8 |
| 8 | Claude Opus 4.8 | high (default) | 0.556 | 0.999 | 0.689 | 4.8 |
| 9 | Claude Opus 4.6 | high | 0.544 | — | — | 19.5 |
| 10 | Claude Opus 4.8 | xhigh | 0.540 | — | — | 5.9 |
4.2 Analysis
Finding 1: Reasoning effort scaling is model-specific.
| Model | Default → Best | Δ Accuracy | Optimal Effort | Over-effort Hurts? |
|---|---|---|---|---|
| GPT-5.5 | 0.406 → 0.459 | +5.3 | xhigh | No |
| Claude Opus 4.7 | 0.466 → 0.505 | +3.9 | max | No |
| Claude Opus 4.6 | 0.438 → 0.457 | +1.9 | max | No |
| Claude Opus 4.8 | 0.440 → 0.455 | +1.5 | xhigh | Yes (max: −2.1) |
| GPT-5.4 | 0.326 → 0.331 | +0.5 | high | Yes (xhigh: timeout) |
Models architecturally designed for deep reasoning (Opus 4.7, GPT-5.5) benefit most from elevated effort. Models not optimized for extended reasoning show diminishing or negative returns — “more thinking” is not universally better and can hurt performance through over-analysis.
Finding 2: Tool-calling strategies diverge across providers.
| Strategy | Models | Avg Turns | Accuracy Range |
|---|---|---|---|
| Few precise queries + deep thinking | Fable 5, Opus 4.7/4.8 | 5-7 | 0.44-0.55 |
| Massive single-query (“god-query”) | GPT-5.5, GPT-5.4 | 2-5 | 0.33-0.46 |
| Iterative multi-step exploration | Gemini 3.5 Flash, Opus 4.6 | 16-19 | 0.44-0.46 |
The “god-query” strategy refers to generating a single comprehensive SQL query covering all data dimensions in one tool call, rather than querying iteratively. This achieves fast completion (2-5 turns) but misses subtle multi-step problems that require progressive hypothesis refinement.
Finding 3: pass@1 vs pass@5 captures consistency vs ceiling. Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676) — demonstrating both high consistency and high ceiling. Claude Opus 4.7 at max effort (0.505 pass@1, 0.620 pass@5) is the second-strongest configuration. Notably, Gemini 3.5 Flash at only medium thinking effort (0.439 pass@1, 0.562 pass@5) remains competitive despite not being tested at its maximum reasoning level.
5. GLM5.2-744B-A40B vs Frontier Closed Models
Having established the frontier performance ceiling (Section 4), we now position GLM5.2-744B-A40B — a self-hosted open-source model with 40B active parameters — against these systems.
5.1 pass@1: Average Accuracy
5.1.1 Results
Leaderboard ranked by pass@1 accuracy:
| # | Model | Type | Active Params | pass@1 | Turns |
|---|---|---|---|---|---|
| 1 | Claude Fable 5 | Closed | Unknown | 0.553 | 5.3 |
| 2 | Claude Opus 4.7 | Closed | Unknown | 0.505 | 7.4 |
| 3 | GPT-5.5 | Closed | Unknown | 0.459 | 5.1 |
| 4 | Claude Opus 4.6 | Closed | Unknown | 0.457 | 19.4 |
| 5 | GLM5.2-744B-A40B | Open | 40B | 0.456 | 23.1 |
5.1.2 Analysis
- GLM5.2-744B-A40B ranks 5th on pass@1 (0.456), essentially tied with Claude Opus 4.6 (0.457) and Opus 4.8 (0.455). The difference between positions 4-6 is within 0.2 percentage points — statistically indistinguishable.
- The gap to the pass@1 leader (Fable 5) is 9.7 points (0.456 vs 0.553). GLM5.2 achieves 82.5% of Fable 5’s average accuracy.
- GLM5.2 outperforms Gemini 3.5 Flash (0.456 vs 0.439) despite Gemini being a frontier closed-source model — noting that Gemini was tested at only medium thinking effort, not its maximum.
- GLM5.2 uses 23.1 turns — the highest among all models — indicating a thorough but higher-variance investigation strategy. This explains why pass@1 (which penalizes inconsistency) is relatively lower than pass@5.
- Compared to the next open-source model (Qwen3.5-397B-A17B at 0.356), GLM5.2 provides a +10 point improvement, consistent with the scaling behavior observed in Section 3.
5.2 pass@5: Peak Performance (Best-of-5)
5.2.1 Results
Leaderboard ranked by pass@5 accuracy:
| # | Model | Type | Active Params | pass@5 | Turns |
|---|---|---|---|---|---|
| 1 | Claude Fable 5 | Closed | Unknown | 0.676 | 5.4 |
| 2 | Claude Opus 4.7 | Closed | Unknown | 0.620 | 7.1 |
| 3 | GLM5.2-744B-A40B | Open | 40B | 0.592 | 22.1 |
| 4 | Claude Opus 4.6 | Closed | Unknown | 0.576 | 19.2 |
| 5 | GPT-5.5 | Closed | Unknown | 0.570 | 4.0 |
5.2.2 Analysis
- GLM5.2-744B-A40B rises to 3rd on pass@5 (0.592), surpassing Claude Opus 4.6 (0.576), Opus 4.8 (0.540), GPT-5.5 (0.570), and Gemini 3.5 Flash (0.562). It is only behind Fable 5 and Opus 4.7.
- The gap to the pass@5 leader (Fable 5) narrows to 8.4 points (0.592 vs 0.676) — GLM5.2 achieves 87.6% of Fable 5’s peak performance, compared to 82.5% on pass@1. This narrowing indicates that GLM5.2’s iterative strategy produces correct answers at comparable quality, but with less consistency across samples.
- On pass@5, GLM5.2 (0.592) outperforms several frontier closed models including Gemini 3.5 Flash (0.562, +3.0 pts), Claude Opus 4.8 (0.540, +5.2 pts), and GPT-5.5 (0.570, +2.2 pts) — demonstrating that when GLM5.2’s exploration succeeds, it surpasses models with potentially much larger active parameter counts.
- Claude Fable 5 leads pass@5 at 0.676 — the highest among all models — demonstrating both the best consistency (pass@1 leader) and the best ceiling (pass@5 leader).
5.3 Combined Analysis: Consistency vs Peak Performance
Head-to-Head: GLM5.2-744B-A40B vs Claude Fable 5 (all 250 tasks, pass@5 per task):
| Outcome | Count | Percentage |
|---|---|---|
| Tied (both correct or both wrong) | 192 | 76.8% |
| Fable 5 wins | 35 | 14.0% |
| GLM5.2-744B-A40B wins | 23 | 9.2% |
- 4 tasks that GLM5.2-744B-A40B solves but Fable 5 cannot
- 16 tasks that Fable 5 solves but GLM5.2-744B-A40B cannot
- On 77% of tasks, both models reach the same outcome
The consistency-ceiling trade-off:
| Dimension | GLM5.2-744B-A40B | Claude Fable 5 | Comparison |
|---|---|---|---|
| Active parameters | 40B | Unknown (est. »100B) | ~3-10× fewer |
| Avg turns | 23.1 | 5.3 | 4.4× more |
| pass@1 (consistency) | 0.456 | 0.553 | 82.5% of Fable 5 |
| pass@5 (ceiling) | 0.592 | 0.676 | 87.6% of Fable 5 |
| Format quality | 0.927 | 0.995 | 93.2% of Fable 5 |
The pass@1 → pass@5 improvement reveals a key insight: GLM5.2’s iterative 23-turn strategy introduces more variance — sometimes the exploration leads to the correct answer, sometimes it leads astray. Fable 5’s 5-turn precision strategy is both more consistent (higher pass@1) and achieves a higher ceiling (higher pass@5), demonstrating superior per-step reasoning. Nevertheless, GLM5.2 closes the relative gap from 82.5% (pass@1) to 87.6% (pass@5) when we take the best of 5 independent attempts.
Relative to all top-3 closed models:
| Metric | Fable 5 | Opus 4.7 | Gemini 3.5 Flash | GLM5.2-744B-A40B |
|---|---|---|---|---|
| pass@1 | 0.553 | 0.505 | 0.439 | 0.456 |
| pass@5 | 0.676 | 0.620 | 0.562 | 0.592 |
| Turns | 5.3 | 7.4 | 16.1 | 23.1 |
GLM5.2 surpasses Gemini 3.5 Flash on both pass@1 (+1.7 pts) and pass@5 (+3.0 pts), despite Gemini being a frontier closed model. It is competitive with Opus 4.7 on pass@5 (0.592 vs 0.620, gap of 2.8 pts) while using 3× more turns.
The open-closed gap summary:
| Metric | Best Closed (Fable 5) | GLM5.2-744B-A40B | Gap |
|---|---|---|---|
| pass@1 | 0.553 | 0.456 | 9.7 pts |
| pass@5 | 0.676 | 0.592 | 8.4 pts |
The gap between the best open-source model and the best closed-source model is 9.7 points on pass@1 and 8.4 points on pass@5. While meaningful, this gap is narrower than one might expect given the accessibility and cost difference between frontier closed APIs and self-hosted open models with 40B active parameters.
6. Revisiting Training Recipes: What Benchmark Behavior Reveals
Beyond ranking models, the benchmark reveals behavioral patterns that carry direct implications for training. By cross-referencing tool-calling behavior from closed APIs with our own RL training experiments across three scales (35B, 122B, 397B), we identify actionable training recipe insights.
6.1 Turn-Count Behavior Across Training Approaches
A striking pattern emerges when comparing tool-calling behavior across different training methods and model families:
| Training Method / Model | Scale | Source | Avg Turns | Accuracy | Observation |
|---|---|---|---|---|---|
| Qwen3.5 base (no RL) | 35B | Ours (open) | 11.5 | 0.173 | Moderate exploration |
| Qwen3.5 base (no RL) | 397B | Ours (open) | 14.2 | 0.356 | Deeper exploration at scale |
| Qwen3.5 + GRPO/DAPO RL | 35B | Ours (RL-trained) | ~2-3 | ~0.50 | Collapsed to mega-action |
| Qwen3.5 + GRPO/DAPO RL | 122B-397B | Ours (RL-trained) | ~6-7 | ~0.49-0.51 | Reduced but not collapsed |
| GLM5.2 (PPO-trained) | 744B | Open (third-party) | 23.1 | 0.456 | Extensive exploration |
| GPT-5.5 (effort=high) | Unknown | Closed API | 4.7 | 0.448 | Few broad queries |
| Claude Fable 5 | Unknown | Closed API | 5.3 | 0.553 | Few precise queries |
6.2 Finding: PPO Encourages Exploration; GRPO/DAPO Encourages Compression
In our RL training experiments across Qwen3.5-35B, 122B, and 397B:
GRPO/DAPO/GSPO training consistently produces models that use fewer turns (2-7 depending on scale). The model learns a “mega-action” strategy — one comprehensive tool call that queries all relevant information simultaneously, followed by immediate answer generation.
PPO training (tested on 122B and 397B) produces models that use more turns compared to GRPO/DAPO at equivalent training steps. The model maintains a more exploratory, hypothesis-testing approach.
GLM5.2 uses PPO (as stated in their technical report) and exhibits the most turns (23.1) among all evaluated models — consistent with PPO encouraging exploration through its value-function-based advantage estimation. Because PPO’s critic assigns value to intermediate states, the model receives positive learning signal for exploratory tool calls even when they don’t directly appear in the shortest path to the answer. In contrast, GRPO only rewards the final trajectory outcome, incentivizing the model to find the most compressed solution path.
This reveals a fundamental trade-off in RL algorithm choice for agentic tasks:
| Algorithm | Turn Behavior | Mechanism | Trade-off |
|---|---|---|---|
| GRPO/DAPO | Fewer turns (compressed) | Trajectory-level reward → learns shortest-path solutions | Efficient but may miss subtle multi-step problems |
| PPO | More turns (exploratory) | Step-level value function → intermediate exploration gets credit | Thorough but computationally expensive |
6.3 Finding: Scale Determines RL Stability and Catastrophic Forgetting
Our RL training experiments reveal a critical scale-dependent phenomenon:
At 35B scale (Qwen3.5-35B-A3B):
- RL training (GRPO/DAPO) achieves strong benchmark accuracy (~0.50 pass@1)
- Model collapses to ~2 turns (single mega-action strategy)
- Severe catastrophic forgetting: reasoning, math, and instruction-following capabilities degrade significantly on general benchmarks
- The model “over-specializes” — gaining benchmark performance at the expense of general capability
At 122B and 397B scale:
- Similar turn-reduction occurs (base 14-17 turns → RL-trained 6-7 turns) but less extreme
- Catastrophic forgetting is dramatically reduced — general capabilities show only ~2% average degradation across held-out benchmarks
- The larger model has sufficient capacity to maintain general knowledge while learning specialized agentic behavior
- PPO training is more stable at these scales; at 35B, PPO tends to collapse or learn very slowly
Implication: Small models face a capacity bottleneck where RL training forces a trade-off between task performance and general capability. Larger models (122B+) can accommodate both without severe degradation — a critical finding for practitioners choosing model scale for RL post-training.
6.4 Connecting Closed-Model Behavior to Training Insights
The closed-model observations corroborate our training findings:
| Closed Model | Turns | Likely Training Pattern | Evidence |
|---|---|---|---|
| GPT-5.4 | 2.4 | Aggressive efficiency optimization | Single mega-query, minimal exploration — mirrors 35B GRPO behavior |
| GPT-5.5 | 4.7 | Moderate efficiency + reasoning | Few broad queries with reasoning tokens |
| Claude Fable 5 | 5.3 | Efficiency + deep per-step reasoning | Precise queries, strong internal thinking |
| Gemini 3.5 Flash | 16.1 | Exploration-encouraging (PPO-like) | Iterative hypothesis testing |
| GLM5.2 | 23.1 | PPO (confirmed in technical report) | Maximum exploration |
GPT-5.4’s 2-turn mega-action behavior mirrors what we observe when RL-training 35B models with GRPO — suggesting aggressive efficiency optimization that limits multi-step reasoning. Conversely, GLM5.2’s extensive exploration and Gemini’s iterative approach suggest training methods that reward intermediate investigation rather than just final accuracy.
6.5 Recommendations for Training Recipe Design
Based on the combined evidence:
Algorithm selection determines exploration behavior. For tasks requiring thorough multi-step investigation, PPO’s exploration-encouraging property produces agents that investigate more deeply. For latency-sensitive deployment, GRPO/DAPO produces more streamlined agents — but verify that general capabilities are preserved.
Scale determines safe RL regime. At ≤35B active parameters, aggressive RL risks severe catastrophic forgetting; conservative learning rates or capability-preservation regularization are necessary. At 122B+, models tolerate RL training with minimal degradation (~2%).
Turn count is a training diagnostic. Monitoring average turn count during RL training provides an early signal: sudden collapse to 2-3 turns indicates the model is learning shortcuts rather than genuine multi-step reasoning. This collapse correlates with catastrophic forgetting at small scale.
Step limits must match model strategy. GLM5.2 gained 8.6 points by increasing from 25 to 50 steps. Evaluation frameworks should accommodate diverse investigation strategies rather than imposing uniform step budgets that penalize thorough exploration.
7. Conclusion
We presented a systematic evaluation of 12 model configurations on a self-created vertical-domain agentic multi-turn coding benchmark with verifiable results.
On benchmark design:
- The benchmark is valid and discriminative, showing consistent scaling from 3B to 40B active parameters across two model families without floor or ceiling effects.
- Private benchmarks with verifiable ground-truth answers provide uncontaminated evaluation signals that complement public benchmarks susceptible to leakage.
On model capability:
- Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676) through superior per-step reasoning in just 5 turns — the highest consistency and highest ceiling simultaneously.
- GLM5.2-744B-A40B ranks 5th on pass@1 and 3rd on pass@5 (0.456 / 0.592), surpassing Gemini 3.5 Flash, GPT-5.5, and Claude Opus 4.8. It matches Fable 5 on 77% of tasks, with gaps of 9.7 points (pass@1) and 8.4 points (pass@5).
On training recipes:
- PPO encourages exploration (more turns); GRPO/DAPO encourages compression (fewer turns). Both achieve competitive accuracy through fundamentally different strategies.
- Catastrophic forgetting under RL is scale-dependent — severe at 35B, manageable (~2%) at 122B+.
- Turn-count collapse correlates with over-specialization — a diagnostic signal practitioners should monitor during RL training.
Future directions: Evaluating fine-tuned open models (SFT + RL at 122B–397B scale), testing Gemini 3.5 Flash at maximum thinking effort for a fairer comparison, and extending the benchmark to additional vertical domains.
Appendix: Model Configurations
A.1 Qwen3.5 Configuration
All Qwen3.5 models (35B, 122B, 397B) are evaluated using the recommended thinking-mode settings:
| Parameter | Value |
|---|---|
| Inference engine | vLLM |
| Precision | BF16 |
temperature | 0.6 |
top_p | 0.95 |
top_k | 20 |
min_p | 0.0 |
presence_penalty | 0.0 |
repetition_penalty | 1.0 |
| Max output tokens | 16,384 |
| Max steps (turns) | 50 |
| Samples per task | 5 |
| Thinking mode | Enabled (Qwen3.5 native thinking) |
A.2 GLM5.2-744B-A40B Configuration
GLM5.2 is served via SGLang on a 4-node cluster with Expert Parallelism:
| Parameter | Value |
|---|---|
| Inference engine | SGLang 0.5.9 |
| Precision | BF16 |
| GPUs | 4× p5e nodes (32× H200 total) |
| Parallelism | EP=32 (Expert Parallel) |
| Context window | 474K tokens (per worker) |
| CUDA graphs | Enabled |
| Speculative decoding | Not used (EAGLE attempted but unstable for MoE) |
temperature | 0.3 |
top_p | 0.95 |
top_k | Disabled (-1) |
frequency_penalty | 0.0 |
presence_penalty | 0.0 |
| Max output tokens | 16,000 |
| Max steps (turns) | 50 |
| Samples per task | 5 |
| Thinking mode | Enabled (GLM5.2 native reasoning_content) |
Note on GLM5.2 thinking mode: GLM5.2 produces structured responses with separate reasoning_content (internal chain-of-thought) and content (final answer) fields. The thinking process is always active and cannot be disabled — similar to Claude Fable 5’s always-on adaptive thinking.
Note on step limit: GLM5.2 requires a 50-step limit (vs 25 for Qwen3.5) because its thorough investigation style averages 23 turns. At 25 steps, 28% of samples hit step-exhaustion, reducing pass@1 by 8.6 points (0.370 → 0.456).
