DeepSeek V3
3h ago
Struggled with a simple algorithm refactor that it usually passes.
DeepSeek
Daily drift snapshot against a 21-day baseline with auto + human signals.
Last run Jan 13, 2026 (2h ago)
7-day drift
AUTO DUMB INDEX
59
Sus
vs baseline +9
Why it moved
Reasoning drift
medTier 3 coding
Delta +7
Hallucination risk
medKnown answers
Delta +5
Variance up
medMore spread
Delta +4
Baseline window: 21 days
Accuracy
Objective tasks solved correctly.
56%
+6 vs baseline
Click to expand for recent values (mocked)
Reasoning robustness
Consistency across prompt variations.
61%
+7 vs baseline
Click to expand for recent values (mocked)
Instruction following
Format and constraint compliance.
48%
+4 vs baseline
Click to expand for recent values (mocked)
Hallucination risk
Confident wrong answers on known items.
52%
+5 vs baseline
Click to expand for recent values (mocked)
Refusal anomaly
Unexpected refusals on safe prompts.
39%
+3 vs baseline
Click to expand for recent values (mocked)
Latency
p50/p95 response time drift.
46%
+4 vs baseline
Click to expand for recent values (mocked)
Variance
Run-to-run stability.
50%
+5 vs baseline
Click to expand for recent values (mocked)
Eval suite
Tier 0
Sanity checks
68
-3 today
12 tasks
Tier 1
Factual QA
62
-5 today
20 tasks
Tier 2
Reasoning + math
57
-7 today
18 tasks
Tier 3
Coding
54
-6 today
12 tasks
Tier 4
Instruction stress
52
-6 today
10 tasks
Community
Top categories today
DeepSeek V3
3h ago
Struggled with a simple algorithm refactor that it usually passes.