GLM 4.7
5h ago
Ignored the JSON-only requirement twice in a row.
Open Source
Daily drift snapshot against a 21-day baseline with auto + human signals.
Last run Jan 13, 2026 (2h ago)
7-day drift
AUTO DUMB INDEX
31
Normal
vs baseline -4
Why it moved
Format jitter
lowExtra tokens
Delta +3
Latency better
lowFaster today
Delta -2
Accuracy steady
lowStable
Delta 0
Baseline window: 21 days
Accuracy
Objective tasks solved correctly.
28%
-2 vs baseline
Click to expand for recent values (mocked)
Reasoning robustness
Consistency across prompt variations.
26%
-1 vs baseline
Click to expand for recent values (mocked)
Instruction following
Format and constraint compliance.
31%
+1 vs baseline
Click to expand for recent values (mocked)
Hallucination risk
Confident wrong answers on known items.
22%
-2 vs baseline
Click to expand for recent values (mocked)
Refusal anomaly
Unexpected refusals on safe prompts.
18%
-3 vs baseline
Click to expand for recent values (mocked)
Latency
p50/p95 response time drift.
27%
-2 vs baseline
Click to expand for recent values (mocked)
Variance
Run-to-run stability.
24%
-1 vs baseline
Click to expand for recent values (mocked)
Eval suite
Tier 0
Sanity checks
82
+2 today
12 tasks
Tier 1
Factual QA
78
+1 today
20 tasks
Tier 2
Reasoning + math
74
0 today
18 tasks
Tier 3
Coding
70
-1 today
12 tasks
Tier 4
Instruction stress
76
+1 today
10 tasks
Community
Top categories today
GLM 4.7
5h ago
Ignored the JSON-only requirement twice in a row.