Open Source

GLM 4.7

Daily drift snapshot against a 21-day baseline with auto + human signals.

Last run Jul 18, 2026 (2h ago)

AUTO31-4

HUMAN280

7-day drift

AUTO DUMB INDEX

OK

31

Normal

vs baseline -4

Why it moved

Today's drivers

Format jitter

low

Extra tokens

Delta +3

Latency better

low

Faster today

Delta -2

Accuracy steady

low

Stable

Delta 0

Baseline window: 21 days

Auto score breakdown

Accuracy

Objective tasks solved correctly.

28%

-2 vs baseline

Click to expand for recent values (mocked)

Reasoning robustness

Consistency across prompt variations.

26%

-1 vs baseline

Click to expand for recent values (mocked)

Instruction following

Format and constraint compliance.

31%

+1 vs baseline

Click to expand for recent values (mocked)

Hallucination risk

Confident wrong answers on known items.

22%

-2 vs baseline

Click to expand for recent values (mocked)

Refusal anomaly

Unexpected refusals on safe prompts.

18%

-3 vs baseline

Click to expand for recent values (mocked)

Latency

p50/p95 response time drift.

27%

-2 vs baseline

Click to expand for recent values (mocked)

Variance

Run-to-run stability.

24%

-1 vs baseline

Click to expand for recent values (mocked)

Eval suite

Task tier performance

Tier 0

Sanity checks

82

+2 today

12 tasks

Tier 1

Factual QA

78

+1 today

20 tasks

Tier 2

Reasoning + math

74

0 today

18 tasks

Tier 3

Coding

70

-1 today

12 tasks

Tier 4

Instruction stress

76

+1 today

10 tasks

Community

Human reports

Top categories today

Format4

Latency3

Instruction2

Reasoning2

Hallucination1

GLM 4.7

5h ago

FormatSeverity 2

Ignored the JSON-only requirement twice in a row.

"Added commentary outside JSON."