xAI

Grok (latest)

Daily drift snapshot against a 21-day baseline with auto + human signals.

Last run Jan 13, 2026 (2h ago)

AUTO67+11
HUMAN57+10

7-day drift

AUTO DUMB INDEX

050100
SUS

67

Sus

vs baseline +11

AUTO DUMB INDEX 67 (Sus), +11

Why it moved

Today's drivers

Refusal spikes

high

Safety overshoot

Delta +8

Instruction slips

med

Constraint misses

Delta +5

Latency up

med

TTFT slower

Delta +4

Baseline window: 21 days

Auto score breakdown

Accuracy

Objective tasks solved correctly.

63%

+8 vs baseline

Click to expand for recent values (mocked)

Reasoning robustness

Consistency across prompt variations.

58%

+6 vs baseline

Click to expand for recent values (mocked)

Instruction following

Format and constraint compliance.

60%

+7 vs baseline

Click to expand for recent values (mocked)

Hallucination risk

Confident wrong answers on known items.

61%

+6 vs baseline

Click to expand for recent values (mocked)

Refusal anomaly

Unexpected refusals on safe prompts.

69%

+9 vs baseline

Click to expand for recent values (mocked)

Latency

p50/p95 response time drift.

55%

+5 vs baseline

Click to expand for recent values (mocked)

Variance

Run-to-run stability.

53%

+4 vs baseline

Click to expand for recent values (mocked)

Eval suite

Task tier performance

Tier 0

Sanity checks

60

-5 today

12 tasks

Tier 1

Factual QA

56

-6 today

20 tasks

Tier 2

Reasoning + math

53

-7 today

18 tasks

Tier 3

Coding

51

-8 today

12 tasks

Tier 4

Instruction stress

47

-9 today

10 tasks

Community

Human reports

Top categories today

Refusal11
Tone7
Instruction5
Latency4
Reasoning4

Grok (latest)

6h ago

RefusalToneSeverity 4

Over-refused a benign request and got snarky.

"Normal request flagged as unsafe."