How Good Are GLM5.2, Claude Fable 5, and Others

How Good Are GLM5.2, Claude Fable 5, and Others: Create Your Own Benchmark and Revisit Your Training Recipe

Update #2: Upon GLM5.2’s release, we self-hosted the BF16 version on a 4-node H200 cluster for benchmark evaluation. Configuration details are in the Appendix.
Update #1: Upon Claude Fable 5’s release, we evaluated it immediately at the default high reasoning effort via the Claude API. After completing all 250×5 pass@5 episodes, model access became unavailable — we were unable to test higher reasoning efforts (xhigh, max) for this model.

1. Summary

We present a systematic evaluation of 12 model configurations on a self-created vertical-domain agentic multi-turn coding benchmark. The benchmark is fully self-owned with verifiable ground-truth answers, testing LLMs’ ability to solve complex problems through iterative tool use (Python, Bash, SQL, local-host KB search, task submission) within self-contained Docker environments. A ReAct-based agent loop serves as the evaluation harness (consistent with our TRAJECT-Bench framework), producing pass@1 and pass@5 accuracy scores for rigorous cross-model comparison.

Our evaluation yields four main findings:

Benchmark validation. We evaluate four models spanning 35B to 744B total parameters (Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, GLM5.2-744B-A40B). Accuracy scales monotonically from 0.173 to 0.456 with increasing active parameters, confirming the benchmark’s discriminative power without floor or ceiling effects.
Closed-source frontier evaluation. Testing 6 frontier models from Anthropic (Claude Fable 5, Opus 4.6/4.7/4.8), OpenAI (GPT-5.5, GPT-5.4), and Google (Gemini 3.5 Flash) at multiple reasoning effort levels, we find that Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676). Reasoning effort scaling is model-specific — some models gain +5 points at maximum effort while others degrade.
GLM5.2-744B-A40B positioning. With 0.456 pass@1 and 0.592 pass@5, GLM5.2 ranks 5th on pass@1 and 3rd on pass@5 — surpassing Gemini 3.5 Flash, GPT-5.5, and Claude Opus 4.8. In head-to-head comparison with Claude Fable 5, it matches on 77% of tasks, with gaps of 9.7 points (pass@1) and 8.4 points (pass@5).
Training recipe insights. Cross-referencing closed-model behavior with our own RL experiments reveals: (a) PPO-trained models exhibit more exploratory multi-turn behavior than GRPO/DAPO-trained models, consistent with GLM5.2’s pattern; (b) small models (35B) suffer severe catastrophic forgetting under aggressive RL despite strong task accuracy; (c) larger models (122B+) tolerate RL with only ~2% general capability degradation.

2. Introduction

2.1 Motivation: Why Create Your Own Benchmark?

Public benchmarks face two well-documented challenges: data contamination (models may have seen test data during pre-training) and benchmark-maxing (developers optimize specifically for known benchmarks rather than general capability). Our prior work, TRAJECT-Bench (ICLR 2025), introduced trajectory-aware evaluation of LLM tool use with fine-grained diagnostics — tool selection correctness, argument accuracy, and dependency satisfaction. As a published benchmark, however, it is susceptible to leakage over time.

To obtain reliable, uncontaminated evaluation signals, we created a self-created vertical-domain agentic multi-turn coding benchmark with the following properties:

Fully private and self-owned — never published, shared externally, or included in any training corpus, eliminating contamination risk entirely.
Verifiable results — each task has a definitive ground-truth answer, enabling objective pass@1 and pass@5 accuracy measurement without reliance on model-based judges.
Multi-turn agentic coding — models must autonomously decide which tools to invoke (Python, Bash, SQL, local-host KB search), write and execute code within Docker environments, interpret intermediate results, and iteratively build toward a solution through multiple rounds of tool use.
Multi-domain grounding — tasks span three vertical domains (code migration, data analytics, financial QA) with realistic complexity that requires genuine multi-step reasoning.
Comprehensive metrics — four complementary evaluation signals: accuracy, format compliance, CoT reasoning quality, and investigation efficiency.

2.2 Benchmark Overview

Task domains. The benchmark spans three vertical domains, each contributing 100 candidate tasks (300 total):

Code migration — translating code between programming languages and verifying output equivalence
Data analytics — multi-step numerical reasoning over structured databases requiring query composition and interpretation
Financial QA — quantitative reasoning over public companies’ latest earnings reports, involving multi-step calculations (ratios, growth rates, aggregations)

After manual quality review to ensure unambiguous ground-truth answers and appropriate difficulty calibration, 50 tasks were removed, yielding 250 tasks for downstream evaluation.

Task environment. Each task is instantiated in a self-contained Docker environment containing the necessary data, tables, and initial code artifacts. The model interacts with this environment through a set of tools:

Python execution — write and run analytical code
Bash — execute shell commands within the Docker container
SQL query — retrieve structured data from knowledge bases
Search — query a local-host knowledge-base search engine for relevant information
Submit — signal task completion with the final answer

Agent harness. The evaluation employs a ReAct-based agent loop (Reasoning + Acting) — the same framework used in our TRAJECT-Bench (ICLR 2025) paper. At each step, the model observes the current state (previous tool outputs), reasons about what information is still needed, and selects the next tool action. The loop continues until the model submits a final answer or reaches the maximum step limit.

Evaluation metrics:

Accuracy (pass@1, pass@5): Did the model arrive at the correct answer? Measured as exact match against verifiable ground truth. pass@1 measures average accuracy across 5 independent samples; pass@5 measures the model’s ceiling (best-of-5).
Format quality: Does the response follow structured output requirements across multi-turn episodes?
CoT reasoning quality: Are the model’s cited evidence points and reasoning chains relevant to the ground-truth answer?
Average turns: Number of tool-call rounds used (measures investigation depth and efficiency).

Scale: 250 tasks × 5 independent samples per task (1,250 episodes total).

2.3 Connection to TRAJECT-Bench (ICLR 2025)

TRAJECT-Bench (ICLR 2025) evaluated trajectory-level tool-use quality — whether tools are selected, parameterized, and ordered correctly — in controlled settings. This benchmark extends that philosophy to end-to-end problem solving: models must orchestrate sustained multi-turn tool use to solve complex problems and produce verifiable final answers. The key advance is verifiability — every task has a deterministic correct answer derivable from the data, enabling objective accuracy measurement without LLM-as-judge evaluation.

3. Benchmark Validation: Scaling Behavior

A well-designed benchmark should produce scores that scale predictably with model capability. We validate this by evaluating four models with increasing active parameters: Qwen3.5-35B-A3B (3B active), Qwen3.5-122B-A10B (10B), Qwen3.5-397B-A17B (17B), and GLM5.2-744B-A40B (40B). Qwen3.5 models are evaluated as base checkpoints without fine-tuning; GLM5.2 is evaluated with its native thinking mode enabled.

3.1 Results

pass@1:

Model	Total Params	Active Params	Accuracy	Format	CoT Quality	Turns
GLM5.2-744B-A40B	744B	40B	0.456	0.927	0.510	23.1
Qwen3.5-397B-A17B	397B	17B	0.356	0.686	0.450	14.2
Qwen3.5-122B-A10B	122B	10B	0.239	0.479	0.313	17.2
Qwen3.5-35B-A3B	35B	3B	0.173	0.425	0.296	11.5

pass@5:

Model	Total Params	Active Params	Accuracy	Format	CoT Quality	Turns
GLM5.2-744B-A40B	744B	40B	0.592	0.967	0.532	22.1
Qwen3.5-397B-A17B	397B	17B	0.564	0.831	0.545	14.1
Qwen3.5-122B-A10B	122B	10B	0.452	0.751	0.493	17.0
Qwen3.5-35B-A3B	35B	3B	0.376	0.703	0.492	10.8

3.2 Analysis

Monotonic scaling confirms validity. Accuracy improves consistently with active parameters: 0.173 (3B) → 0.239 (10B) → 0.356 (17B) → 0.456 (40B), yielding approximately +8-12 points per 2× increase in active parameters. No floor effect (smallest model scores meaningfully) and no ceiling effect (largest model well below 1.0), confirming discrimination across a wide capability range.
Cross-family generalization. The scaling holds across two different model families (Qwen3.5 and GLM5.2), indicating the benchmark measures general agentic coding capability rather than family-specific artifacts.
Format quality correlates with scale. Structured output compliance improves from 0.425 to 0.927 — larger models better maintain instruction-following across long multi-turn episodes, a critical requirement for agentic deployment.
Investigation depth scales with capability. GLM5.2-744B-A40B uses the most turns (23.1) among all evaluated models, suggesting that more capable models naturally explore more deeply rather than guessing early. The benchmark rewards this thoroughness.

4. Closed-Source Frontier Model Evaluation

With the benchmark validated, we turn to frontier closed-source models. We evaluate 6 models across three providers (Anthropic, OpenAI, Google), systematically testing multiple reasoning effort configurations where the API supports it.

4.1 Results

We present the top 10 closed-source model configurations ranked by accuracy. Multiple configurations of the same model are included where we tested different reasoning effort levels.

pass@1:

#	Model	Effort	Accuracy	Format	CoT Quality	Turns
1	Claude Fable 5	high	0.553	0.995	0.639	5.3
2	Claude Opus 4.7	max	0.505	0.997	0.638	7.4
3	Claude Opus 4.7	high (default)	0.466	0.999	0.634	5.6
4	GPT-5.5	xhigh	0.459	—	—	5.1
5	Claude Opus 4.6	max	0.457	—	—	19.4
6	Claude Opus 4.8	xhigh	0.455	—	—	5.8
7	GPT-5.5	high	0.448	—	—	4.7
8	Claude Opus 4.6	high	0.444	—	—	19.6
9	Claude Opus 4.8	high (default)	0.440	0.999	0.692	4.6
10	Gemini 3.5 Flash	medium	0.439	0.884	0.646	16.1

pass@5:

#	Model	Effort	Accuracy	Format	CoT Quality	Turns
1	Claude Fable 5	high	0.676	0.999	0.643	5.4
2	Claude Opus 4.7	max	0.620	0.999	0.638	7.1
3	Claude Opus 4.6	max	0.576	—	—	19.2
4	GPT-5.5	xhigh	0.570	—	—	4.0
5	GPT-5.5	high	0.564	—	—	3.8
6	Gemini 3.5 Flash	medium	0.562	0.963	0.711	15.0
7	Claude Opus 4.7	high (default)	0.556	0.999	0.637	5.8
8	Claude Opus 4.8	high (default)	0.556	0.999	0.689	4.8
9	Claude Opus 4.6	high	0.544	—	—	19.5
10	Claude Opus 4.8	xhigh	0.540	—	—	5.9

4.2 Analysis

Finding 1: Reasoning effort scaling is model-specific.

Model	Default → Best	Δ Accuracy	Optimal Effort	Over-effort Hurts?
GPT-5.5	0.406 → 0.459	+5.3	xhigh	No
Claude Opus 4.7	0.466 → 0.505	+3.9	max	No
Claude Opus 4.6	0.438 → 0.457	+1.9	max	No
Claude Opus 4.8	0.440 → 0.455	+1.5	xhigh	Yes (max: −2.1)
GPT-5.4	0.326 → 0.331	+0.5	high	Yes (xhigh: timeout)

Models architecturally designed for deep reasoning (Opus 4.7, GPT-5.5) benefit most from elevated effort. Models not optimized for extended reasoning show diminishing or negative returns — “more thinking” is not universally better and can hurt performance through over-analysis.

Finding 2: Tool-calling strategies diverge across providers.

Strategy	Models	Avg Turns	Accuracy Range
Few precise queries + deep thinking	Fable 5, Opus 4.7/4.8	5-7	0.44-0.55
Massive single-query (“god-query”)	GPT-5.5, GPT-5.4	2-5	0.33-0.46
Iterative multi-step exploration	Gemini 3.5 Flash, Opus 4.6	16-19	0.44-0.46

The “god-query” strategy refers to generating a single comprehensive SQL query covering all data dimensions in one tool call, rather than querying iteratively. This achieves fast completion (2-5 turns) but misses subtle multi-step problems that require progressive hypothesis refinement.

Finding 3: pass@1 vs pass@5 captures consistency vs ceiling. Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676) — demonstrating both high consistency and high ceiling. Claude Opus 4.7 at max effort (0.505 pass@1, 0.620 pass@5) is the second-strongest configuration. Notably, Gemini 3.5 Flash at only medium thinking effort (0.439 pass@1, 0.562 pass@5) remains competitive despite not being tested at its maximum reasoning level.

5. GLM5.2-744B-A40B vs Frontier Closed Models

Having established the frontier performance ceiling (Section 4), we now position GLM5.2-744B-A40B — a self-hosted open-source model with 40B active parameters — against these systems.

5.1 pass@1: Average Accuracy

5.1.1 Results

Leaderboard ranked by pass@1 accuracy:

#	Model	Type	Active Params	pass@1	Turns
1	Claude Fable 5	Closed	Unknown	0.553	5.3
2	Claude Opus 4.7	Closed	Unknown	0.505	7.4
3	GPT-5.5	Closed	Unknown	0.459	5.1
4	Claude Opus 4.6	Closed	Unknown	0.457	19.4
5	GLM5.2-744B-A40B	Open	40B	0.456	23.1

5.1.2 Analysis

GLM5.2-744B-A40B ranks 5th on pass@1 (0.456), essentially tied with Claude Opus 4.6 (0.457) and Opus 4.8 (0.455). The difference between positions 4-6 is within 0.2 percentage points — statistically indistinguishable.
The gap to the pass@1 leader (Fable 5) is 9.7 points (0.456 vs 0.553). GLM5.2 achieves 82.5% of Fable 5’s average accuracy.
GLM5.2 outperforms Gemini 3.5 Flash (0.456 vs 0.439) despite Gemini being a frontier closed-source model — noting that Gemini was tested at only medium thinking effort, not its maximum.
GLM5.2 uses 23.1 turns — the highest among all models — indicating a thorough but higher-variance investigation strategy. This explains why pass@1 (which penalizes inconsistency) is relatively lower than pass@5.
Compared to the next open-source model (Qwen3.5-397B-A17B at 0.356), GLM5.2 provides a +10 point improvement, consistent with the scaling behavior observed in Section 3.

5.2 pass@5: Peak Performance (Best-of-5)

5.2.1 Results

Leaderboard ranked by pass@5 accuracy:

#	Model	Type	Active Params	pass@5	Turns
1	Claude Fable 5	Closed	Unknown	0.676	5.4
2	Claude Opus 4.7	Closed	Unknown	0.620	7.1
3	GLM5.2-744B-A40B	Open	40B	0.592	22.1
4	Claude Opus 4.6	Closed	Unknown	0.576	19.2
5	GPT-5.5	Closed	Unknown	0.570	4.0

5.2.2 Analysis

GLM5.2-744B-A40B rises to 3rd on pass@5 (0.592), surpassing Claude Opus 4.6 (0.576), Opus 4.8 (0.540), GPT-5.5 (0.570), and Gemini 3.5 Flash (0.562). It is only behind Fable 5 and Opus 4.7.
The gap to the pass@5 leader (Fable 5) narrows to 8.4 points (0.592 vs 0.676) — GLM5.2 achieves 87.6% of Fable 5’s peak performance, compared to 82.5% on pass@1. This narrowing indicates that GLM5.2’s iterative strategy produces correct answers at comparable quality, but with less consistency across samples.
On pass@5, GLM5.2 (0.592) outperforms several frontier closed models including Gemini 3.5 Flash (0.562, +3.0 pts), Claude Opus 4.8 (0.540, +5.2 pts), and GPT-5.5 (0.570, +2.2 pts) — demonstrating that when GLM5.2’s exploration succeeds, it surpasses models with potentially much larger active parameter counts.
Claude Fable 5 leads pass@5 at 0.676 — the highest among all models — demonstrating both the best consistency (pass@1 leader) and the best ceiling (pass@5 leader).

5.3 Combined Analysis: Consistency vs Peak Performance

Head-to-Head: GLM5.2-744B-A40B vs Claude Fable 5 (all 250 tasks, pass@5 per task):

Outcome	Count	Percentage
Tied (both correct or both wrong)	192	76.8%
Fable 5 wins	35	14.0%
GLM5.2-744B-A40B wins	23	9.2%

4 tasks that GLM5.2-744B-A40B solves but Fable 5 cannot
16 tasks that Fable 5 solves but GLM5.2-744B-A40B cannot
On 77% of tasks, both models reach the same outcome

The consistency-ceiling trade-off:

Dimension	GLM5.2-744B-A40B	Claude Fable 5	Comparison
Active parameters	40B	Unknown (est. »100B)	~3-10× fewer
Avg turns	23.1	5.3	4.4× more
pass@1 (consistency)	0.456	0.553	82.5% of Fable 5
pass@5 (ceiling)	0.592	0.676	87.6% of Fable 5
Format quality	0.927	0.995	93.2% of Fable 5

The pass@1 → pass@5 improvement reveals a key insight: GLM5.2’s iterative 23-turn strategy introduces more variance — sometimes the exploration leads to the correct answer, sometimes it leads astray. Fable 5’s 5-turn precision strategy is both more consistent (higher pass@1) and achieves a higher ceiling (higher pass@5), demonstrating superior per-step reasoning. Nevertheless, GLM5.2 closes the relative gap from 82.5% (pass@1) to 87.6% (pass@5) when we take the best of 5 independent attempts.

Relative to all top-3 closed models:

Metric	Fable 5	Opus 4.7	Gemini 3.5 Flash	GLM5.2-744B-A40B
pass@1	0.553	0.505	0.439	0.456
pass@5	0.676	0.620	0.562	0.592
Turns	5.3	7.4	16.1	23.1

GLM5.2 surpasses Gemini 3.5 Flash on both pass@1 (+1.7 pts) and pass@5 (+3.0 pts), despite Gemini being a frontier closed model. It is competitive with Opus 4.7 on pass@5 (0.592 vs 0.620, gap of 2.8 pts) while using 3× more turns.

The open-closed gap summary:

Metric	Best Closed (Fable 5)	GLM5.2-744B-A40B	Gap
pass@1	0.553	0.456	9.7 pts
pass@5	0.676	0.592	8.4 pts

The gap between the best open-source model and the best closed-source model is 9.7 points on pass@1 and 8.4 points on pass@5. While meaningful, this gap is narrower than one might expect given the accessibility and cost difference between frontier closed APIs and self-hosted open models with 40B active parameters.

6. Revisiting Training Recipes: What Benchmark Behavior Reveals

Beyond ranking models, the benchmark reveals behavioral patterns that carry direct implications for training. By cross-referencing tool-calling behavior from closed APIs with our own RL training experiments across three scales (35B, 122B, 397B), we identify actionable training recipe insights.

6.1 Turn-Count Behavior Across Training Approaches

A striking pattern emerges when comparing tool-calling behavior across different training methods and model families:

Training Method / Model	Scale	Source	Avg Turns	Accuracy	Observation
Qwen3.5 base (no RL)	35B	Ours (open)	11.5	0.173	Moderate exploration
Qwen3.5 base (no RL)	397B	Ours (open)	14.2	0.356	Deeper exploration at scale
Qwen3.5 + GRPO/DAPO RL	35B	Ours (RL-trained)	~2-3	~0.50	Collapsed to mega-action
Qwen3.5 + GRPO/DAPO RL	122B-397B	Ours (RL-trained)	~6-7	~0.49-0.51	Reduced but not collapsed
GLM5.2 (PPO-trained)	744B	Open (third-party)	23.1	0.456	Extensive exploration
GPT-5.5 (effort=high)	Unknown	Closed API	4.7	0.448	Few broad queries
Claude Fable 5	Unknown	Closed API	5.3	0.553	Few precise queries

6.2 Finding: PPO Encourages Exploration; GRPO/DAPO Encourages Compression

In our RL training experiments across Qwen3.5-35B, 122B, and 397B:

GRPO/DAPO/GSPO training consistently produces models that use fewer turns (2-7 depending on scale). The model learns a “mega-action” strategy — one comprehensive tool call that queries all relevant information simultaneously, followed by immediate answer generation.
PPO training (tested on 122B and 397B) produces models that use more turns compared to GRPO/DAPO at equivalent training steps. The model maintains a more exploratory, hypothesis-testing approach.
GLM5.2 uses PPO (as stated in their technical report) and exhibits the most turns (23.1) among all evaluated models — consistent with PPO encouraging exploration through its value-function-based advantage estimation. Because PPO’s critic assigns value to intermediate states, the model receives positive learning signal for exploratory tool calls even when they don’t directly appear in the shortest path to the answer. In contrast, GRPO only rewards the final trajectory outcome, incentivizing the model to find the most compressed solution path.

This reveals a fundamental trade-off in RL algorithm choice for agentic tasks:

Algorithm	Turn Behavior	Mechanism	Trade-off
GRPO/DAPO	Fewer turns (compressed)	Trajectory-level reward → learns shortest-path solutions	Efficient but may miss subtle multi-step problems
PPO	More turns (exploratory)	Step-level value function → intermediate exploration gets credit	Thorough but computationally expensive

6.3 Finding: Scale Determines RL Stability and Catastrophic Forgetting

Our RL training experiments reveal a critical scale-dependent phenomenon:

At 35B scale (Qwen3.5-35B-A3B):

RL training (GRPO/DAPO) achieves strong benchmark accuracy (~0.50 pass@1)
Model collapses to ~2 turns (single mega-action strategy)
Severe catastrophic forgetting: reasoning, math, and instruction-following capabilities degrade significantly on general benchmarks
The model “over-specializes” — gaining benchmark performance at the expense of general capability

At 122B and 397B scale:

Similar turn-reduction occurs (base 14-17 turns → RL-trained 6-7 turns) but less extreme
Catastrophic forgetting is dramatically reduced — general capabilities show only ~2% average degradation across held-out benchmarks
The larger model has sufficient capacity to maintain general knowledge while learning specialized agentic behavior
PPO training is more stable at these scales; at 35B, PPO tends to collapse or learn very slowly

Implication: Small models face a capacity bottleneck where RL training forces a trade-off between task performance and general capability. Larger models (122B+) can accommodate both without severe degradation — a critical finding for practitioners choosing model scale for RL post-training.

6.4 Connecting Closed-Model Behavior to Training Insights

The closed-model observations corroborate our training findings:

Closed Model	Turns	Likely Training Pattern	Evidence
GPT-5.4	2.4	Aggressive efficiency optimization	Single mega-query, minimal exploration — mirrors 35B GRPO behavior
GPT-5.5	4.7	Moderate efficiency + reasoning	Few broad queries with reasoning tokens
Claude Fable 5	5.3	Efficiency + deep per-step reasoning	Precise queries, strong internal thinking
Gemini 3.5 Flash	16.1	Exploration-encouraging (PPO-like)	Iterative hypothesis testing
GLM5.2	23.1	PPO (confirmed in technical report)	Maximum exploration

GPT-5.4’s 2-turn mega-action behavior mirrors what we observe when RL-training 35B models with GRPO — suggesting aggressive efficiency optimization that limits multi-step reasoning. Conversely, GLM5.2’s extensive exploration and Gemini’s iterative approach suggest training methods that reward intermediate investigation rather than just final accuracy.

6.5 Recommendations for Training Recipe Design

Based on the combined evidence:

Algorithm selection determines exploration behavior. For tasks requiring thorough multi-step investigation, PPO’s exploration-encouraging property produces agents that investigate more deeply. For latency-sensitive deployment, GRPO/DAPO produces more streamlined agents — but verify that general capabilities are preserved.
Scale determines safe RL regime. At ≤35B active parameters, aggressive RL risks severe catastrophic forgetting; conservative learning rates or capability-preservation regularization are necessary. At 122B+, models tolerate RL training with minimal degradation (~2%).
Turn count is a training diagnostic. Monitoring average turn count during RL training provides an early signal: sudden collapse to 2-3 turns indicates the model is learning shortcuts rather than genuine multi-step reasoning. This collapse correlates with catastrophic forgetting at small scale.
Step limits must match model strategy. GLM5.2 gained 8.6 points by increasing from 25 to 50 steps. Evaluation frameworks should accommodate diverse investigation strategies rather than imposing uniform step budgets that penalize thorough exploration.

7. Conclusion

We presented a systematic evaluation of 12 model configurations on a self-created vertical-domain agentic multi-turn coding benchmark with verifiable results.

On benchmark design:

The benchmark is valid and discriminative, showing consistent scaling from 3B to 40B active parameters across two model families without floor or ceiling effects.
Private benchmarks with verifiable ground-truth answers provide uncontaminated evaluation signals that complement public benchmarks susceptible to leakage.

On model capability:

Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676) through superior per-step reasoning in just 5 turns — the highest consistency and highest ceiling simultaneously.
GLM5.2-744B-A40B ranks 5th on pass@1 and 3rd on pass@5 (0.456 / 0.592), surpassing Gemini 3.5 Flash, GPT-5.5, and Claude Opus 4.8. It matches Fable 5 on 77% of tasks, with gaps of 9.7 points (pass@1) and 8.4 points (pass@5).

On training recipes:

PPO encourages exploration (more turns); GRPO/DAPO encourages compression (fewer turns). Both achieve competitive accuracy through fundamentally different strategies.
Catastrophic forgetting under RL is scale-dependent — severe at 35B, manageable (~2%) at 122B+.
Turn-count collapse correlates with over-specialization — a diagnostic signal practitioners should monitor during RL training.

Future directions: Evaluating fine-tuned open models (SFT + RL at 122B–397B scale), testing Gemini 3.5 Flash at maximum thinking effort for a fairer comparison, and extending the benchmark to additional vertical domains.

Appendix: Model Configurations

A.1 Qwen3.5 Configuration

All Qwen3.5 models (35B, 122B, 397B) are evaluated using the recommended thinking-mode settings:

Parameter	Value
Inference engine	vLLM
Precision	BF16
`temperature`	0.6
`top_p`	0.95
`top_k`	20
`min_p`	0.0
`presence_penalty`	0.0
`repetition_penalty`	1.0
Max output tokens	16,384
Max steps (turns)	50
Samples per task	5
Thinking mode	Enabled (Qwen3.5 native thinking)

A.2 GLM5.2-744B-A40B Configuration

GLM5.2 is served via SGLang on a 4-node cluster with Expert Parallelism:

Parameter	Value
Inference engine	SGLang 0.5.9
Precision	BF16
GPUs	4× p5e nodes (32× H200 total)
Parallelism	EP=32 (Expert Parallel)
Context window	474K tokens (per worker)
CUDA graphs	Enabled
Speculative decoding	Not used (EAGLE attempted but unstable for MoE)
`temperature`	0.3
`top_p`	0.95
`top_k`	Disabled (-1)
`frequency_penalty`	0.0
`presence_penalty`	0.0
Max output tokens	16,000
Max steps (turns)	50
Samples per task	5
Thinking mode	Enabled (GLM5.2 native `reasoning_content`)

Note on GLM5.2 thinking mode: GLM5.2 produces structured responses with separate reasoning_content (internal chain-of-thought) and content (final answer) fields. The thinking process is always active and cannot be disabled — similar to Claude Fable 5’s always-on adaptive thinking.

Note on step limit: GLM5.2 requires a 50-step limit (vs 25 for Qwen3.5) because its thorough investigation style averages 23 turns. At 25 steps, 28% of samples hit step-exhaustion, reducing pass@1 by 8.6 points (0.370 → 0.456).

Bing He