Open Source

Minimax M2

Daily drift snapshot against a 21-day baseline with auto + human signals.

Last run Jan 13, 2026 (2h ago)

AUTO54+5
HUMAN22-1

7-day drift

AUTO DUMB INDEX

050100
SUS

54

Sus

vs baseline +5

AUTO DUMB INDEX 54 (Sus), +5

Why it moved

Today's drivers

Instruction drift

med

JSON compliance

Delta +6

Latency up

med

p95 higher

Delta +4

Accuracy steady

low

Flat baseline

Delta +1

Baseline window: 21 days

Auto score breakdown

Accuracy

Objective tasks solved correctly.

46%

+4 vs baseline

Click to expand for recent values (mocked)

Reasoning robustness

Consistency across prompt variations.

42%

+3 vs baseline

Click to expand for recent values (mocked)

Instruction following

Format and constraint compliance.

55%

+7 vs baseline

Click to expand for recent values (mocked)

Hallucination risk

Confident wrong answers on known items.

39%

+2 vs baseline

Click to expand for recent values (mocked)

Refusal anomaly

Unexpected refusals on safe prompts.

33%

+1 vs baseline

Click to expand for recent values (mocked)

Latency

p50/p95 response time drift.

50%

+5 vs baseline

Click to expand for recent values (mocked)

Variance

Run-to-run stability.

44%

+3 vs baseline

Click to expand for recent values (mocked)

Eval suite

Task tier performance

Tier 0

Sanity checks

66

-4 today

12 tasks

Tier 1

Factual QA

63

-3 today

20 tasks

Tier 2

Reasoning + math

60

-5 today

18 tasks

Tier 3

Coding

58

-4 today

12 tasks

Tier 4

Instruction stress

55

-6 today

10 tasks

Community

Human reports

Top categories today

Instruction6
Latency5
Format4
Reasoning3
Hallucination2

Minimax M2

7h ago

InstructionLatencySeverity 3

Missed a formatting constraint and slowed down.

"Returned extra fields and took 9s."