DeepSeek

DeepSeek V3

Daily drift snapshot against a 21-day baseline with auto + human signals.

Last run Jan 13, 2026 (2h ago)

AUTO59+9
HUMAN41+6

7-day drift

AUTO DUMB INDEX

050100
SUS

59

Sus

vs baseline +9

AUTO DUMB INDEX 59 (Sus), +9

Why it moved

Today's drivers

Reasoning drift

med

Tier 3 coding

Delta +7

Hallucination risk

med

Known answers

Delta +5

Variance up

med

More spread

Delta +4

Baseline window: 21 days

Auto score breakdown

Accuracy

Objective tasks solved correctly.

56%

+6 vs baseline

Click to expand for recent values (mocked)

Reasoning robustness

Consistency across prompt variations.

61%

+7 vs baseline

Click to expand for recent values (mocked)

Instruction following

Format and constraint compliance.

48%

+4 vs baseline

Click to expand for recent values (mocked)

Hallucination risk

Confident wrong answers on known items.

52%

+5 vs baseline

Click to expand for recent values (mocked)

Refusal anomaly

Unexpected refusals on safe prompts.

39%

+3 vs baseline

Click to expand for recent values (mocked)

Latency

p50/p95 response time drift.

46%

+4 vs baseline

Click to expand for recent values (mocked)

Variance

Run-to-run stability.

50%

+5 vs baseline

Click to expand for recent values (mocked)

Eval suite

Task tier performance

Tier 0

Sanity checks

68

-3 today

12 tasks

Tier 1

Factual QA

62

-5 today

20 tasks

Tier 2

Reasoning + math

57

-7 today

18 tasks

Tier 3

Coding

54

-6 today

12 tasks

Tier 4

Instruction stress

52

-6 today

10 tasks

Community

Human reports

Top categories today

Reasoning8
Hallucination6
Instruction4
Latency4
Refusal3

DeepSeek V3

3h ago

ReasoningSeverity 4

Struggled with a simple algorithm refactor that it usually passes.

"Failed a known unit test case."