G4 Work Eval — Pick the right AI model for the work, not the leaderboard

The metric

Balanced Value: one number for the daily call.

A blended score that rewards what production work actually needs: quality you can defend, cost you can scale, and consistency you can trust. The cost term is gated by consistency — a $0.03 run that crashes 6% of the time loses to a steady $0.09 one.

Read the methodology

Balanced Value=

0.72·QQuality

0.23·CCost gated by S

0.05·EReliability

Higher = better. The full breakdown lives in the full methodology.

The frontier of quality × cost

Where fourteen models land on cost and quality.

Dashed line traces the efficient frontier: no model on it is beaten on both price and score.

Four lenses at a glance

"Value" isn't one thing. The leader changes with the lens.

Each card shows the same fourteen models scored under a different cost-aware lens. The headline is the lens leader; the strip below is the non-dominated frontier (models not beaten on both quality and price). Click a card to read what the lens optimizes for and why it matters.

Final Ranking

One leader, four honest signals.

The Balanced Value Index blends quality, cost efficiency and reliability, gated by consistency so cheap-but-volatile models can't game the board. For most work outside of coding it is the right default ranking. It is not the only one: switch the tabs, or set your own weights below.

Your priorities

Quality72%

Cost23%

Reliability5%

Drag to re-rank the board with your own weights. The default 72 / 23 / 5 mirrors Balanced Value; cost stays gated by consistency.

# Model Balanced (higher = better) 88.9

The decision

88.9

top Balanced Value

Kimi

Balanced #1

DeepSeek

Balanced #2

MiniMax

Balanced #3

Kimi and DeepSeek lead the Balanced view.

Kimi K2.7 Code (88.9) and DeepSeek V4 Pro (88.6) sit at the top of Balanced Value on the same published scale. MiniMax M3 is third at 83.9, ahead of GLM 5.2 (76.2) and GPT-5.5/GPT-5.4 (75.0). GLM 5.2 has the strongest new median (91), while Claude Fable 5 remains the pure Quality Core leader. The right deployment is still a routing policy, not a single-model bet.

Low-cost default

Kimi K2.7DeepSeek V4

Kimi and DeepSeek lead Balanced Value with low cost and high Pass@55/Pass@80. Use them as the default low-cost lane behind validation.

High-stakes deliverables

Fable 5GPT-5.5Opus 4.7

Claude Fable 5 remains the pure quality ceiling: Quality Core 93.5, Pass@80 100%, and a high floor. Use this lane when review cost dominates token cost.

Tail-risk guardrail

GLM 5.2GPT-5.4MiniMax M3

GLM's median is excellent, but its wider lower tail keeps it behind the best value rows. Route it behind validation and fallback.

This routing layer is what G4 OS ships: every model above in one workspace, routed for quality, speed and cost. Download G4 OS → · Benchmark code and data on GitHub.

Switch the metric tab to re-rank · Balanced is the recommended default for everyday selection.

Methodology

One formula, fully transparent.

Quality is 72% of the final score. Cost is 23%, gated by consistency so a cheap run with a failure tail loses to a steadier model. Reliability adds 5%. All rows use the same published scale.

Drill-down

How the scores spread, model by model.

Two views: the range of scores each model produces across 10 runs per model-task, and the operating cost per run. The mean (open circle) and median (filled circle) markers expose each model's consistency.

Quality score per run (0–100)

ordered by reliability-adjusted score (μ − σ/2) · mean (○) · median (●) · min → max

Cost per run (USD)

no-cache · same row order as quality — read across to see cost vs reliability

Operational Drag: wasted steps and time

Agentic work costs more than tokens. Extra steps and longer tool loops add latency, orchestration load, error surface, and user-perceived unpredictability. Drag is a real cost separable from quality and price.

1.7Lowest · Fable 5

91.2Highest · Gemini Flash

Acceptable vs Excellent: two lenses, two defaults

The two cost-aware lenses point at different defaults on purpose. Acceptable rewards cheap reliability for volume; Excellent rewards premium quality that isn't wildly overpriced. The crossover is the trade-off.

Why balanced wins

Single-metric rankings misrank.

Four facts that change which model you should reach for tomorrow — and why a blended score beats any one number.

Quality is crowded

85–93.5

The quality ceiling is unchanged: Fable 5 at 93.5. GLM 5.2 joins the high-median band, while Kimi K2.7 and DeepSeek win mainly through cost-adjusted value.

Cost is not

100×

$0.011 (DeepSeek) to $1.11 (Fable 5) per run. The 100× cost spread dwarfs the quality spread. At scale, cost is the variable that moves.

Downside tails

8.5%

GLM 5.2 has a 5.4% catastrophic tail. Nova, Sonnet and Kimi K2.6 also show material tails. Cheap specialists need routing & fallback — not blanket adoption.

Different #1 per lens

3 winners

Kimi K2.7 and DeepSeek lead the Balanced lens. MiniMax M3 is now third on Balanced Value at 83.9. Fable 5 still wins Quality Core and Excellent Value.

Pricing stress test

What if cheap tokens get more expensive?

Today's price gap between US frontier models and China-linked low-cost models is the widest the industry has ever priced. The benchmark would be naive to ignore that. We re-rank the field under two price shocks — holding quality, latency, errors, and tokens fixed.

Scenario A · Targeted shock

The field tightens

Only MiniMax and Kimi reprice upward (~2×). Kimi keeps the lead, while MiniMax M3 sits ahead of the GPT pair on Balanced Value. The margin narrows, but low absolute cost still matters.

Kimi K2.7·82.7

MiniMax M3·79.3

GPT-5.4↑75.3

GPT-5.5↑75.2

Scenario B · Whole-market reprice

Low-cost leaders hold

All models reprice upward; the low-cost pair rises more. Kimi and MiniMax remain ahead of the GPT pair because their starting cost is low enough to absorb a broad price shock.

Kimi K2.7·82.7

MiniMax M3·79.3

GPT-5.4≈75.3

GPT-5.5·75.2

The right operating model is therefore not a fixed bet on the cheapest provider — it's a pricing-aware routing layer that recomputes value when unit economics move. This page is a snapshot; production should run a router.

Per-task performance

Model by model, task by task.

Real mean score per task (0–100) across 10 runs per model×task. Darker cell = higher score. The model's row ordering and the avg column re-rank in real time when you switch modes. All numbers come from the official 1,820-run set, including the recovered Opus 4.7 / 4.8 runs. The T4 Slack CX column remains the clearest separator in the suite.

Mean score on the task · pure quality signal

Low High Color scales within each task column · border marks the column leader

Task winners

A different #1 on almost every task.

"Best overall" uses task-local Quality Core. "Best cost-adjusted" uses task-local Acceptable Value. The fact that these two columns disagree on most tasks is the whole argument for routing.

The 13 tasks

Inside the thirteen tasks.

Click any task to expand the full prompt, definition of done, sources and Pass@1.

What would this task cost you?

Pick the task, up to three models, and either how often you run it or a plain number of runs. Totals use the median cost per run observed in the benchmark; tags compare your picks on the benchmark scores.

Pick the right model for the work,
not the leaderboard.