Minimax M2
7h ago
Missed a formatting constraint and slowed down.
Open Source
Daily drift snapshot against a 21-day baseline with auto + human signals.
Last run Jan 13, 2026 (2h ago)
7-day drift
AUTO DUMB INDEX
54
Sus
vs baseline +5
Why it moved
Instruction drift
medJSON compliance
Delta +6
Latency up
medp95 higher
Delta +4
Accuracy steady
lowFlat baseline
Delta +1
Baseline window: 21 days
Accuracy
Objective tasks solved correctly.
46%
+4 vs baseline
Click to expand for recent values (mocked)
Reasoning robustness
Consistency across prompt variations.
42%
+3 vs baseline
Click to expand for recent values (mocked)
Instruction following
Format and constraint compliance.
55%
+7 vs baseline
Click to expand for recent values (mocked)
Hallucination risk
Confident wrong answers on known items.
39%
+2 vs baseline
Click to expand for recent values (mocked)
Refusal anomaly
Unexpected refusals on safe prompts.
33%
+1 vs baseline
Click to expand for recent values (mocked)
Latency
p50/p95 response time drift.
50%
+5 vs baseline
Click to expand for recent values (mocked)
Variance
Run-to-run stability.
44%
+3 vs baseline
Click to expand for recent values (mocked)
Eval suite
Tier 0
Sanity checks
66
-4 today
12 tasks
Tier 1
Factual QA
63
-3 today
20 tasks
Tier 2
Reasoning + math
60
-5 today
18 tasks
Tier 3
Coding
58
-4 today
12 tasks
Tier 4
Instruction stress
55
-6 today
10 tasks
Community
Top categories today
Minimax M2
7h ago
Missed a formatting constraint and slowed down.