Google

Gemini

Daily drift snapshot against a 21-day baseline with auto + human signals.

Last run Jan 13, 2026 (2h ago)

AUTO83+20
HUMAN70+16

7-day drift

AUTO DUMB INDEX

050100
BROKEN

83

Emergency

vs baseline +20

AUTO DUMB INDEX 83 (Emergency), +20

Why it moved

Today's drivers

Hallucination risk

high

Tier 1 QA

Delta +11

Reasoning drift

high

Logic tasks

Delta +9

Refusal spikes

med

Over-cautious

Delta +6

Variance up

med

Wide spread

Delta +5

Baseline window: 21 days

Auto score breakdown

Accuracy

Objective tasks solved correctly.

78%

+13 vs baseline

Click to expand for recent values (mocked)

Reasoning robustness

Consistency across prompt variations.

74%

+11 vs baseline

Click to expand for recent values (mocked)

Instruction following

Format and constraint compliance.

69%

+8 vs baseline

Click to expand for recent values (mocked)

Hallucination risk

Confident wrong answers on known items.

82%

+14 vs baseline

Click to expand for recent values (mocked)

Refusal anomaly

Unexpected refusals on safe prompts.

63%

+7 vs baseline

Click to expand for recent values (mocked)

Latency

p50/p95 response time drift.

58%

+5 vs baseline

Click to expand for recent values (mocked)

Variance

Run-to-run stability.

70%

+9 vs baseline

Click to expand for recent values (mocked)

Eval suite

Task tier performance

Tier 0

Sanity checks

58

-7 today

12 tasks

Tier 1

Factual QA

52

-9 today

20 tasks

Tier 2

Reasoning + math

47

-11 today

18 tasks

Tier 3

Coding

49

-10 today

12 tasks

Tier 4

Instruction stress

41

-12 today

10 tasks

Community

Human reports

Top categories today

Hallucination14
Reasoning10
Refusal7
Instruction6
Latency5

Gemini

4h ago

HallucinationReasoningSeverity 5

Confidently gave wrong steps in a deterministic math task.

"Got 2+2=5 with long reasoning."