How Good Are GLM5.2, Claude Fable 5, and Others

How Good Are GLM5.2, Claude Fable 5, and Others: Create Your Own Benchmark and Revisit Your Training Recipe

Update #2: Upon GLM5.2’s release, we self-hosted the BF16 version on a 4-node H200 cluster for benchmark evaluation. Configuration details are in the Appendix.

Update #1: Upon Claude Fable 5’s release, we evaluated it immediately at the default high reasoning effort via the Claude API. After completing all 250×5 pass@5 episodes, model access became unavailable — we were unable to test higher reasoning efforts (xhigh, max) for this model.

1. Summary

We present a systematic evaluation of 12 model configurations on a self-created vertical-domain agentic multi-turn coding benchmark. The benchmark is fully self-owned with verifiable ground-truth answers, testing LLMs’ ability to solve complex problems through iterative tool use (Python, Bash, SQL, local-host KB search, task submission) within self-contained Docker environments. A ReAct-based agent loop serves as the evaluation harness (consistent with our TRAJECT-Bench framework), producing pass@1 and pass@5 accuracy scores for rigorous cross-model comparison.

Our evaluation yields four main findings:

  1. Benchmark validation. We evaluate four models spanning 35B to 744B total parameters (Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, GLM5.2-744B-A40B). Accuracy scales monotonically from 0.173 to 0.456 with increasing active parameters, confirming the benchmark’s discriminative power without floor or ceiling effects.

  2. Closed-source frontier evaluation. Testing 6 frontier models from Anthropic (Claude Fable 5, Opus 4.6/4.7/4.8), OpenAI (GPT-5.5, GPT-5.4), and Google (Gemini 3.5 Flash) at multiple reasoning effort levels, we find that Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676). Reasoning effort scaling is model-specific — some models gain +5 points at maximum effort while others degrade.

  3. GLM5.2-744B-A40B positioning. With 0.456 pass@1 and 0.592 pass@5, GLM5.2 ranks 5th on pass@1 and 3rd on pass@5 — surpassing Gemini 3.5 Flash, GPT-5.5, and Claude Opus 4.8. In head-to-head comparison with Claude Fable 5, it matches on 77% of tasks, with gaps of 9.7 points (pass@1) and 8.4 points (pass@5).

  4. Training recipe insights. Cross-referencing closed-model behavior with our own RL experiments reveals: (a) PPO-trained models exhibit more exploratory multi-turn behavior than GRPO/DAPO-trained models, consistent with GLM5.2’s pattern; (b) small models (35B) suffer severe catastrophic forgetting under aggressive RL despite strong task accuracy; (c) larger models (122B+) tolerate RL with only ~2% general capability degradation.


2. Introduction

2.1 Motivation: Why Create Your Own Benchmark?

Public benchmarks face two well-documented challenges: data contamination (models may have seen test data during pre-training) and benchmark-maxing (developers optimize specifically for known benchmarks rather than general capability). Our prior work, TRAJECT-Bench (ICLR 2025), introduced trajectory-aware evaluation of LLM tool use with fine-grained diagnostics — tool selection correctness, argument accuracy, and dependency satisfaction. As a published benchmark, however, it is susceptible to leakage over time.

To obtain reliable, uncontaminated evaluation signals, we created a self-created vertical-domain agentic multi-turn coding benchmark with the following properties:

2.2 Benchmark Overview

Task domains. The benchmark spans three vertical domains, each contributing 100 candidate tasks (300 total):

After manual quality review to ensure unambiguous ground-truth answers and appropriate difficulty calibration, 50 tasks were removed, yielding 250 tasks for downstream evaluation.

Task environment. Each task is instantiated in a self-contained Docker environment containing the necessary data, tables, and initial code artifacts. The model interacts with this environment through a set of tools:

Agent harness. The evaluation employs a ReAct-based agent loop (Reasoning + Acting) — the same framework used in our TRAJECT-Bench (ICLR 2025) paper. At each step, the model observes the current state (previous tool outputs), reasons about what information is still needed, and selects the next tool action. The loop continues until the model submits a final answer or reaches the maximum step limit.

Evaluation metrics:

Scale: 250 tasks × 5 independent samples per task (1,250 episodes total).

2.3 Connection to TRAJECT-Bench (ICLR 2025)

TRAJECT-Bench (ICLR 2025) evaluated trajectory-level tool-use quality — whether tools are selected, parameterized, and ordered correctly — in controlled settings. This benchmark extends that philosophy to end-to-end problem solving: models must orchestrate sustained multi-turn tool use to solve complex problems and produce verifiable final answers. The key advance is verifiability — every task has a deterministic correct answer derivable from the data, enabling objective accuracy measurement without LLM-as-judge evaluation.


3. Benchmark Validation: Scaling Behavior

A well-designed benchmark should produce scores that scale predictably with model capability. We validate this by evaluating four models with increasing active parameters: Qwen3.5-35B-A3B (3B active), Qwen3.5-122B-A10B (10B), Qwen3.5-397B-A17B (17B), and GLM5.2-744B-A40B (40B). Qwen3.5 models are evaluated as base checkpoints without fine-tuning; GLM5.2 is evaluated with its native thinking mode enabled.

3.1 Results

pass@1:

ModelTotal ParamsActive ParamsAccuracyFormatCoT QualityTurns
GLM5.2-744B-A40B744B40B0.4560.9270.51023.1
Qwen3.5-397B-A17B397B17B0.3560.6860.45014.2
Qwen3.5-122B-A10B122B10B0.2390.4790.31317.2
Qwen3.5-35B-A3B35B3B0.1730.4250.29611.5

pass@5:

ModelTotal ParamsActive ParamsAccuracyFormatCoT QualityTurns
GLM5.2-744B-A40B744B40B0.5920.9670.53222.1
Qwen3.5-397B-A17B397B17B0.5640.8310.54514.1
Qwen3.5-122B-A10B122B10B0.4520.7510.49317.0
Qwen3.5-35B-A3B35B3B0.3760.7030.49210.8

3.2 Analysis

  1. Monotonic scaling confirms validity. Accuracy improves consistently with active parameters: 0.173 (3B) → 0.239 (10B) → 0.356 (17B) → 0.456 (40B), yielding approximately +8-12 points per 2× increase in active parameters. No floor effect (smallest model scores meaningfully) and no ceiling effect (largest model well below 1.0), confirming discrimination across a wide capability range.

  2. Cross-family generalization. The scaling holds across two different model families (Qwen3.5 and GLM5.2), indicating the benchmark measures general agentic coding capability rather than family-specific artifacts.

  3. Format quality correlates with scale. Structured output compliance improves from 0.425 to 0.927 — larger models better maintain instruction-following across long multi-turn episodes, a critical requirement for agentic deployment.

  4. Investigation depth scales with capability. GLM5.2-744B-A40B uses the most turns (23.1) among all evaluated models, suggesting that more capable models naturally explore more deeply rather than guessing early. The benchmark rewards this thoroughness.


4. Closed-Source Frontier Model Evaluation

With the benchmark validated, we turn to frontier closed-source models. We evaluate 6 models across three providers (Anthropic, OpenAI, Google), systematically testing multiple reasoning effort configurations where the API supports it.

4.1 Results

We present the top 10 closed-source model configurations ranked by accuracy. Multiple configurations of the same model are included where we tested different reasoning effort levels.

pass@1:

#ModelEffortAccuracyFormatCoT QualityTurns
1Claude Fable 5high0.5530.9950.6395.3
2Claude Opus 4.7max0.5050.9970.6387.4
3Claude Opus 4.7high (default)0.4660.9990.6345.6
4GPT-5.5xhigh0.4595.1
5Claude Opus 4.6max0.45719.4
6Claude Opus 4.8xhigh0.4555.8
7GPT-5.5high0.4484.7
8Claude Opus 4.6high0.44419.6
9Claude Opus 4.8high (default)0.4400.9990.6924.6
10Gemini 3.5 Flashmedium0.4390.8840.64616.1

pass@5:

#ModelEffortAccuracyFormatCoT QualityTurns
1Claude Fable 5high0.6760.9990.6435.4
2Claude Opus 4.7max0.6200.9990.6387.1
3Claude Opus 4.6max0.57619.2
4GPT-5.5xhigh0.5704.0
5GPT-5.5high0.5643.8
6Gemini 3.5 Flashmedium0.5620.9630.71115.0
7Claude Opus 4.7high (default)0.5560.9990.6375.8
8Claude Opus 4.8high (default)0.5560.9990.6894.8
9Claude Opus 4.6high0.54419.5
10Claude Opus 4.8xhigh0.5405.9

4.2 Analysis

Finding 1: Reasoning effort scaling is model-specific.

ModelDefault → BestΔ AccuracyOptimal EffortOver-effort Hurts?
GPT-5.50.406 → 0.459+5.3xhighNo
Claude Opus 4.70.466 → 0.505+3.9maxNo
Claude Opus 4.60.438 → 0.457+1.9maxNo
Claude Opus 4.80.440 → 0.455+1.5xhighYes (max: −2.1)
GPT-5.40.326 → 0.331+0.5highYes (xhigh: timeout)

Models architecturally designed for deep reasoning (Opus 4.7, GPT-5.5) benefit most from elevated effort. Models not optimized for extended reasoning show diminishing or negative returns — “more thinking” is not universally better and can hurt performance through over-analysis.

Finding 2: Tool-calling strategies diverge across providers.

StrategyModelsAvg TurnsAccuracy Range
Few precise queries + deep thinkingFable 5, Opus 4.7/4.85-70.44-0.55
Massive single-query (“god-query”)GPT-5.5, GPT-5.42-50.33-0.46
Iterative multi-step explorationGemini 3.5 Flash, Opus 4.616-190.44-0.46

The “god-query” strategy refers to generating a single comprehensive SQL query covering all data dimensions in one tool call, rather than querying iteratively. This achieves fast completion (2-5 turns) but misses subtle multi-step problems that require progressive hypothesis refinement.

Finding 3: pass@1 vs pass@5 captures consistency vs ceiling. Claude Fable 5 leads both pass@1 (0.553) and pass@5 (0.676) — demonstrating both high consistency and high ceiling. Claude Opus 4.7 at max effort (0.505 pass@1, 0.620 pass@5) is the second-strongest configuration. Notably, Gemini 3.5 Flash at only medium thinking effort (0.439 pass@1, 0.562 pass@5) remains competitive despite not being tested at its maximum reasoning level.


5. GLM5.2-744B-A40B vs Frontier Closed Models

Having established the frontier performance ceiling (Section 4), we now position GLM5.2-744B-A40B — a self-hosted open-source model with 40B active parameters — against these systems.

5.1 pass@1: Average Accuracy

5.1.1 Results

Leaderboard ranked by pass@1 accuracy:

#ModelTypeActive Paramspass@1Turns
1Claude Fable 5ClosedUnknown0.5535.3
2Claude Opus 4.7ClosedUnknown0.5057.4
3GPT-5.5ClosedUnknown0.4595.1
4Claude Opus 4.6ClosedUnknown0.45719.4
5GLM5.2-744B-A40BOpen40B0.45623.1

5.1.2 Analysis

5.2 pass@5: Peak Performance (Best-of-5)

5.2.1 Results

Leaderboard ranked by pass@5 accuracy:

#ModelTypeActive Paramspass@5Turns
1Claude Fable 5ClosedUnknown0.6765.4
2Claude Opus 4.7ClosedUnknown0.6207.1
3GLM5.2-744B-A40BOpen40B0.59222.1
4Claude Opus 4.6ClosedUnknown0.57619.2
5GPT-5.5ClosedUnknown0.5704.0

5.2.2 Analysis

5.3 Combined Analysis: Consistency vs Peak Performance

Head-to-Head: GLM5.2-744B-A40B vs Claude Fable 5 (all 250 tasks, pass@5 per task):

OutcomeCountPercentage
Tied (both correct or both wrong)19276.8%
Fable 5 wins3514.0%
GLM5.2-744B-A40B wins239.2%

The consistency-ceiling trade-off:

DimensionGLM5.2-744B-A40BClaude Fable 5Comparison
Active parameters40BUnknown (est. »100B)~3-10× fewer
Avg turns23.15.34.4× more
pass@1 (consistency)0.4560.55382.5% of Fable 5
pass@5 (ceiling)0.5920.67687.6% of Fable 5
Format quality0.9270.99593.2% of Fable 5

The pass@1 → pass@5 improvement reveals a key insight: GLM5.2’s iterative 23-turn strategy introduces more variance — sometimes the exploration leads to the correct answer, sometimes it leads astray. Fable 5’s 5-turn precision strategy is both more consistent (higher pass@1) and achieves a higher ceiling (higher pass@5), demonstrating superior per-step reasoning. Nevertheless, GLM5.2 closes the relative gap from 82.5% (pass@1) to 87.6% (pass@5) when we take the best of 5 independent attempts.

Relative to all top-3 closed models:

MetricFable 5Opus 4.7Gemini 3.5 FlashGLM5.2-744B-A40B
pass@10.5530.5050.4390.456
pass@50.6760.6200.5620.592
Turns5.37.416.123.1

GLM5.2 surpasses Gemini 3.5 Flash on both pass@1 (+1.7 pts) and pass@5 (+3.0 pts), despite Gemini being a frontier closed model. It is competitive with Opus 4.7 on pass@5 (0.592 vs 0.620, gap of 2.8 pts) while using 3× more turns.

The open-closed gap summary:

MetricBest Closed (Fable 5)GLM5.2-744B-A40BGap
pass@10.5530.4569.7 pts
pass@50.6760.5928.4 pts

The gap between the best open-source model and the best closed-source model is 9.7 points on pass@1 and 8.4 points on pass@5. While meaningful, this gap is narrower than one might expect given the accessibility and cost difference between frontier closed APIs and self-hosted open models with 40B active parameters.


6. Revisiting Training Recipes: What Benchmark Behavior Reveals

Beyond ranking models, the benchmark reveals behavioral patterns that carry direct implications for training. By cross-referencing tool-calling behavior from closed APIs with our own RL training experiments across three scales (35B, 122B, 397B), we identify actionable training recipe insights.

6.1 Turn-Count Behavior Across Training Approaches

A striking pattern emerges when comparing tool-calling behavior across different training methods and model families:

Training Method / ModelScaleSourceAvg TurnsAccuracyObservation
Qwen3.5 base (no RL)35BOurs (open)11.50.173Moderate exploration
Qwen3.5 base (no RL)397BOurs (open)14.20.356Deeper exploration at scale
Qwen3.5 + GRPO/DAPO RL35BOurs (RL-trained)~2-3~0.50Collapsed to mega-action
Qwen3.5 + GRPO/DAPO RL122B-397BOurs (RL-trained)~6-7~0.49-0.51Reduced but not collapsed
GLM5.2 (PPO-trained)744BOpen (third-party)23.10.456Extensive exploration
GPT-5.5 (effort=high)UnknownClosed API4.70.448Few broad queries
Claude Fable 5UnknownClosed API5.30.553Few precise queries

6.2 Finding: PPO Encourages Exploration; GRPO/DAPO Encourages Compression

In our RL training experiments across Qwen3.5-35B, 122B, and 397B:

This reveals a fundamental trade-off in RL algorithm choice for agentic tasks:

AlgorithmTurn BehaviorMechanismTrade-off
GRPO/DAPOFewer turns (compressed)Trajectory-level reward → learns shortest-path solutionsEfficient but may miss subtle multi-step problems
PPOMore turns (exploratory)Step-level value function → intermediate exploration gets creditThorough but computationally expensive

6.3 Finding: Scale Determines RL Stability and Catastrophic Forgetting

Our RL training experiments reveal a critical scale-dependent phenomenon:

At 35B scale (Qwen3.5-35B-A3B):

At 122B and 397B scale:

Implication: Small models face a capacity bottleneck where RL training forces a trade-off between task performance and general capability. Larger models (122B+) can accommodate both without severe degradation — a critical finding for practitioners choosing model scale for RL post-training.

6.4 Connecting Closed-Model Behavior to Training Insights

The closed-model observations corroborate our training findings:

Closed ModelTurnsLikely Training PatternEvidence
GPT-5.42.4Aggressive efficiency optimizationSingle mega-query, minimal exploration — mirrors 35B GRPO behavior
GPT-5.54.7Moderate efficiency + reasoningFew broad queries with reasoning tokens
Claude Fable 55.3Efficiency + deep per-step reasoningPrecise queries, strong internal thinking
Gemini 3.5 Flash16.1Exploration-encouraging (PPO-like)Iterative hypothesis testing
GLM5.223.1PPO (confirmed in technical report)Maximum exploration

GPT-5.4’s 2-turn mega-action behavior mirrors what we observe when RL-training 35B models with GRPO — suggesting aggressive efficiency optimization that limits multi-step reasoning. Conversely, GLM5.2’s extensive exploration and Gemini’s iterative approach suggest training methods that reward intermediate investigation rather than just final accuracy.

6.5 Recommendations for Training Recipe Design

Based on the combined evidence:

  1. Algorithm selection determines exploration behavior. For tasks requiring thorough multi-step investigation, PPO’s exploration-encouraging property produces agents that investigate more deeply. For latency-sensitive deployment, GRPO/DAPO produces more streamlined agents — but verify that general capabilities are preserved.

  2. Scale determines safe RL regime. At ≤35B active parameters, aggressive RL risks severe catastrophic forgetting; conservative learning rates or capability-preservation regularization are necessary. At 122B+, models tolerate RL training with minimal degradation (~2%).

  3. Turn count is a training diagnostic. Monitoring average turn count during RL training provides an early signal: sudden collapse to 2-3 turns indicates the model is learning shortcuts rather than genuine multi-step reasoning. This collapse correlates with catastrophic forgetting at small scale.

  4. Step limits must match model strategy. GLM5.2 gained 8.6 points by increasing from 25 to 50 steps. Evaluation frameworks should accommodate diverse investigation strategies rather than imposing uniform step budgets that penalize thorough exploration.


7. Conclusion

We presented a systematic evaluation of 12 model configurations on a self-created vertical-domain agentic multi-turn coding benchmark with verifiable results.

On benchmark design:

On model capability:

On training recipes:

Future directions: Evaluating fine-tuned open models (SFT + RL at 122B–397B scale), testing Gemini 3.5 Flash at maximum thinking effort for a fairer comparison, and extending the benchmark to additional vertical domains.


Appendix: Model Configurations

A.1 Qwen3.5 Configuration

All Qwen3.5 models (35B, 122B, 397B) are evaluated using the recommended thinking-mode settings:

ParameterValue
Inference enginevLLM
PrecisionBF16
temperature0.6
top_p0.95
top_k20
min_p0.0
presence_penalty0.0
repetition_penalty1.0
Max output tokens16,384
Max steps (turns)50
Samples per task5
Thinking modeEnabled (Qwen3.5 native thinking)

A.2 GLM5.2-744B-A40B Configuration

GLM5.2 is served via SGLang on a 4-node cluster with Expert Parallelism:

ParameterValue
Inference engineSGLang 0.5.9
PrecisionBF16
GPUs4× p5e nodes (32× H200 total)
ParallelismEP=32 (Expert Parallel)
Context window474K tokens (per worker)
CUDA graphsEnabled
Speculative decodingNot used (EAGLE attempted but unstable for MoE)
temperature0.3
top_p0.95
top_kDisabled (-1)
frequency_penalty0.0
presence_penalty0.0
Max output tokens16,000
Max steps (turns)50
Samples per task5
Thinking modeEnabled (GLM5.2 native reasoning_content)

Note on GLM5.2 thinking mode: GLM5.2 produces structured responses with separate reasoning_content (internal chain-of-thought) and content (final answer) fields. The thinking process is always active and cannot be disabled — similar to Claude Fable 5’s always-on adaptive thinking.

Note on step limit: GLM5.2 requires a 50-step limit (vs 25 for Qwen3.5) because its thorough investigation style averages 23 turns. At 25 steps, 28% of samples hit step-exhaustion, reducing pass@1 by 8.6 points (0.370 → 0.456).