Gemini
4h ago
Confidently gave wrong steps in a deterministic math task.
Daily drift snapshot against a 21-day baseline with auto + human signals.
Last run Jan 13, 2026 (2h ago)
7-day drift
AUTO DUMB INDEX
83
Emergency
vs baseline +20
Why it moved
Hallucination risk
highTier 1 QA
Delta +11
Reasoning drift
highLogic tasks
Delta +9
Refusal spikes
medOver-cautious
Delta +6
Variance up
medWide spread
Delta +5
Baseline window: 21 days
Accuracy
Objective tasks solved correctly.
78%
+13 vs baseline
Click to expand for recent values (mocked)
Reasoning robustness
Consistency across prompt variations.
74%
+11 vs baseline
Click to expand for recent values (mocked)
Instruction following
Format and constraint compliance.
69%
+8 vs baseline
Click to expand for recent values (mocked)
Hallucination risk
Confident wrong answers on known items.
82%
+14 vs baseline
Click to expand for recent values (mocked)
Refusal anomaly
Unexpected refusals on safe prompts.
63%
+7 vs baseline
Click to expand for recent values (mocked)
Latency
p50/p95 response time drift.
58%
+5 vs baseline
Click to expand for recent values (mocked)
Variance
Run-to-run stability.
70%
+9 vs baseline
Click to expand for recent values (mocked)
Eval suite
Tier 0
Sanity checks
58
-7 today
12 tasks
Tier 1
Factual QA
52
-9 today
20 tasks
Tier 2
Reasoning + math
47
-11 today
18 tasks
Tier 3
Coding
49
-10 today
12 tasks
Tier 4
Instruction stress
41
-12 today
10 tasks
Community
Top categories today
Gemini
4h ago
Confidently gave wrong steps in a deterministic math task.