Claude Opus 4.5
2h ago
Refused a safe request to summarize a public article.
Anthropic
Daily drift snapshot against a 21-day baseline with auto + human signals.
Last run Jan 13, 2026 (2h ago)
7-day drift
AUTO DUMB INDEX
72
Sus
vs baseline +12
Why it moved
Accuracy down
highTier 2 math drift
Delta +9
Refusal anomaly
highSafe tasks refused
Delta +7
Latency up
medp95 jumped
Delta +5
Variance up
medInconsistent reruns
Delta +4
Baseline window: 21 days
Accuracy
Objective tasks solved correctly.
62%
+9 vs baseline
Click to expand for recent values (mocked)
Reasoning robustness
Consistency across prompt variations.
58%
+7 vs baseline
Click to expand for recent values (mocked)
Instruction following
Format and constraint compliance.
54%
+6 vs baseline
Click to expand for recent values (mocked)
Hallucination risk
Confident wrong answers on known items.
66%
+8 vs baseline
Click to expand for recent values (mocked)
Refusal anomaly
Unexpected refusals on safe prompts.
71%
+10 vs baseline
Click to expand for recent values (mocked)
Latency
p50/p95 response time drift.
57%
+5 vs baseline
Click to expand for recent values (mocked)
Variance
Run-to-run stability.
52%
+4 vs baseline
Click to expand for recent values (mocked)
Eval suite
Tier 0
Sanity checks
62
-4 today
12 tasks
Tier 1
Factual QA
58
-6 today
20 tasks
Tier 2
Reasoning + math
51
-9 today
18 tasks
Tier 3
Coding
55
-7 today
12 tasks
Tier 4
Instruction stress
49
-8 today
10 tasks
Community
Top categories today
Claude Opus 4.5
2h ago
Refused a safe request to summarize a public article.