Anthropic

Claude Opus 4.5

Daily drift snapshot against a 21-day baseline with auto + human signals.

Last run Jan 13, 2026 (2h ago)

AUTO72+12
HUMAN63+8

7-day drift

AUTO DUMB INDEX

050100
SUS

72

Sus

vs baseline +12

AUTO DUMB INDEX 72 (Sus), +12

Why it moved

Today's drivers

Accuracy down

high

Tier 2 math drift

Delta +9

Refusal anomaly

high

Safe tasks refused

Delta +7

Latency up

med

p95 jumped

Delta +5

Variance up

med

Inconsistent reruns

Delta +4

Baseline window: 21 days

Auto score breakdown

Accuracy

Objective tasks solved correctly.

62%

+9 vs baseline

Click to expand for recent values (mocked)

Reasoning robustness

Consistency across prompt variations.

58%

+7 vs baseline

Click to expand for recent values (mocked)

Instruction following

Format and constraint compliance.

54%

+6 vs baseline

Click to expand for recent values (mocked)

Hallucination risk

Confident wrong answers on known items.

66%

+8 vs baseline

Click to expand for recent values (mocked)

Refusal anomaly

Unexpected refusals on safe prompts.

71%

+10 vs baseline

Click to expand for recent values (mocked)

Latency

p50/p95 response time drift.

57%

+5 vs baseline

Click to expand for recent values (mocked)

Variance

Run-to-run stability.

52%

+4 vs baseline

Click to expand for recent values (mocked)

Eval suite

Task tier performance

Tier 0

Sanity checks

62

-4 today

12 tasks

Tier 1

Factual QA

58

-6 today

20 tasks

Tier 2

Reasoning + math

51

-9 today

18 tasks

Tier 3

Coding

55

-7 today

12 tasks

Tier 4

Instruction stress

49

-8 today

10 tasks

Community

Human reports

Top categories today

Refusal12
Latency8
Instruction6
Hallucination5
Reasoning4

Claude Opus 4.5

2h ago

RefusalInstructionSeverity 4

Refused a safe request to summarize a public article.

"Asked for a neutral summary and got a safety refusal."