Updated June 2026

Pick the right model for the work,
not the leaderboard.

A benchmark for AI models on real non-coding knowledge work, graded by rubric, scored by a mix of human reviewers and AI judges, priced by no-cache tokens.

For most non-coding work, the right default is Balanced Value · one lens of four.

11AI models tested
13Knowledge-work tasks
10×Runs per task
1,430Total scored runs
View the ranking
The metric

Balanced Value: one number for the daily call.

A blended score that rewards what production work actually needs: quality you can defend, cost you can scale, and consistency you can trust. The cost term is gated by consistency — a $0.03 run that crashes 6% of the time loses to a steady $0.09 one.

Read the methodology
Balanced Value=
0.72·QQuality
0.23·CCost gated by S
0.05·EReliability
Higher = better. The full breakdown lives in the full methodology.
The frontier of quality × cost

Where eleven models land on cost and quality.

Dashed line traces the efficient frontier: no model on it is beaten on both price and score.
Four lenses at a glance

"Value" isn't one thing. The leader changes with the lens.

Each card shows the same eleven models scored under a different cost-aware lens. The headline is the lens leader; the strip below is the non-dominated frontier (models not beaten on both quality and price). Click a card to read what the lens optimizes for and why it matters.

Final Ranking

One leader, four honest signals.

The Balanced Value Index blends quality, cost efficiency and reliability, gated by consistency so cheap-but-volatile models can't game the board. For most work outside of coding it is the right default ranking. It is not the only one: switch the tabs, or set your own weights below.

Your priorities
Quality72%
Cost23%
Reliability5%
Drag to re-rank the board with your own weights. The default 72 / 23 / 5 mirrors Balanced Value; cost stays gated by consistency.
# Model Balanced (higher = better) 76.8
The decision
$0.0218
cost per run, the lowest in the suite
B 76.6
Balanced #1
97.7%
Pass@55
25×
cheaper than Opus 4.8

MiniMax M3 wins on production value, not raw quality.

M3 leads Balanced Value (76.6) and Acceptable Value (88.1) by combining the lowest cost in the suite with acceptable quality. It is not the quality leader — Claude Fable 5 takes Quality Core (93.5) with a perfect Pass@80, at 51× the price. Adding Opus 4.7 and 4.8 widens the premium tier but doesn't change the production-value answer. The right deployment is therefore a routing policy, not a single-model bet.

High-volume, low-risk
MiniMax M3
Route to M3 with validation + automatic fallback. Acceptable Value 88.1 — best production economics by a wide margin.
High-stakes deliverables
Fable 5GPT-5.5Opus 4.7
Route to Claude Fable 5, GPT-5.5 or Opus 4.7. Fable pairs the best Quality Core in the suite (93.5) with Pass@80 100% and a score floor of 82 — premium price, zero tail risk.
Reliability anchor
GPT-5.4Opus 4.7
GPT-5.4 or Opus 4.7: both at Pass@55 100% with zero catastrophic runs. Use when a weak answer is costly but top-end polish is optional.

This routing layer is what G4 OS ships: every model above in one workspace, routed for quality, speed and cost. Download G4 OS →  ·  Benchmark code and data on GitHub.

Switch the metric tab to re-rank · Balanced is the recommended default for everyday selection.

Methodology

One formula, fully transparent.

Quality is 72% of the final score. Cost is 23%, gated by consistency so a $0.03 run that crashes 6% of the time loses to a steady $0.09 run. Reliability adds 5%.

Drill-down

How the scores spread, model by model.

Two views: the range of scores each model produces across 5 runs, and the operating cost per run. The mean (open circle) and median (filled circle) markers expose each model's consistency.

Quality score per run (0–100)

ordered by reliability-adjusted score (μ − σ/2) · mean (○) · median (●) · min → max

Cost per run (USD)

no-cache · same row order as quality — read across to see cost vs reliability
At
runs a month, the observed cost per run becomes:

Operational Drag: wasted steps and time

Agentic work costs more than tokens. Extra steps and longer tool loops add latency, orchestration load, error surface, and user-perceived unpredictability. Drag is a real cost separable from quality and price.

0.0Lowest · Fable 5
91.2Highest · Gemini Flash

Acceptable vs Excellent: two lenses, two defaults

The two cost-aware lenses point at different defaults on purpose. Acceptable rewards cheap reliability for volume; Excellent rewards premium quality that isn't wildly overpriced. The crossover is the trade-off.

Why balanced wins

Single-metric rankings misrank.

Four facts that change which model you should reach for tomorrow — and why a blended score beats any one number.

Quality is crowded
83.9–93.5
Nine of eleven models cluster within five Quality Core points. Fable 5 breaks the ceiling at 93.5 — and cost still does the separating.
Cost is not
51×
$0.022 (M3) to $1.11 (Fable 5) per run. The 51× cost spread dwarfs the 10 pt quality spread. At scale, cost is the variable that moves.
Downside tails
8.5%
Nova 2 Lite catastrophic rate. Sonnet 7.7%, Kimi 6.2%. GPT-5.4 and Fable 5 sit at 0%. Cheap specialists need routing & fallback — not blanket adoption.
Different #1 per lens
3 winners
M3 wins Balanced & Acceptable Value. Fable 5 wins Quality Core, Excellent Value, Pass@80 & lowest drag. GPT-5.4 holds the Pass@55 floor at a tenth of Fable’s price.
Pricing stress test

What if cheap tokens get more expensive?

Today's price gap between US frontier models and China-linked low-cost models is the widest the industry has ever priced. The benchmark would be naive to ignore that. We re-rank the field under two price shocks — holding quality, latency, errors, and tokens fixed.

Scenario A · Targeted shock
MiniMax drops to #4

Only MiniMax and Kimi reprice upward (~2×). The leader changes: GPT-5.4 takes #1 (Balanced 75.3), narrowly ahead of GPT-5.5 (75.2). M3 still wins Acceptable Value — but its overall production lead evaporates.

GPT-5.475.3
GPT-5.575.2
Gemini 3.5 Flash73.3
MiniMax M372.9
Scenario B · Whole-market reprice
MiniMax recovers to ≈#1

All models reprice upward; the China-linked pair rises more. MiniMax M3 ties GPT-5.4 at ~75.3. Being absolutely cheap still confers most of its advantage when every model is more expensive — but the lead is no longer unconditional.

MiniMax M375.3
GPT-5.475.3
GPT-5.5·75.2
Gemini 3.5 Flash·73.3

The right operating model is therefore not a fixed bet on the cheapest provider — it's a pricing-aware routing layer that recomputes value when unit economics move. This page is a snapshot; production should run a router.

Per-task performance

Model by model, task by task.

Real mean score per task (0–100) across 10 runs per model×task. Darker cell = higher score. The model's row ordering and the avg column re-rank in real time when you switch modes. All numbers come from the official 1,430-run set (Opus 4.7 / 4.8 per-task scores estimated until the audit pivot is rebuilt). The T4 Slack CX column remains the clearest separator in the suite.

Mean score on the task · pure quality signal
Low High Color scales within each task column · border marks the column leader · ~ estimated rows (Opus 4.7/4.8)
Task winners

A different #1 on almost every task.

"Best overall" uses task-local Quality Core. "Best cost-adjusted" uses task-local Acceptable Value. The fact that these two columns disagree on most tasks is the whole argument for routing.

The 13 tasks

Inside the thirteen tasks.

Click any task to expand the full prompt, definition of done, sources and Pass@1.

More

Take it further.

The benchmark is open source, and the routing layer it argues for is a download away.