Daily LLM vibe check for Jan 13, 2026

DUMB METER

A daily snapshot of when popular models drift from their baseline. Auto evals + human reports, distilled into a loud, shareable signal.

How the score works

Daily cadence

One run / 24h

Baseline window

21 days rolling

Signal mix

Auto + Human

Overall weirdness

OVERALL

Sus

Today feels

Medium

Featured

Models on the edge

Anthropic

Claude Opus 4.5

SUS

AUTO72+12

HUMAN63+8

7-day drift

vs baseline +12

Top issueRefusals up

OpenAI

ChatGPT 5.2 Pro

AUTO48+3

HUMAN35-2

7-day drift

vs baseline +3

Top issueLatency up

Google

Gemini

BROKEN

AUTO83+20

HUMAN70+16

7-day drift

vs baseline +20

Top issueHallucinations up

Full lineup

Open Source

Minimax M2

SUS

AUTO54+5

HUMAN22-1

7-day drift

vs baseline +5

Top issueInstruction slips

Open Source

GLM 4.7

AUTO31-4

HUMAN280

7-day drift

vs baseline -4

Top issueFormat jitter

DeepSeek

DeepSeek V3

SUS

AUTO59+9

HUMAN41+6

7-day drift

vs baseline +9

Top issueReasoning drift

xAI

Grok (latest)

SUS

AUTO67+11

HUMAN57+10

7-day drift

vs baseline +11

Top issueRefusal spikes

Human signal

Today's reports

Claude Opus 4.5

2h ago

RefusalInstructionSeverity 4

Refused a safe request to summarize a public article.

"Asked for a neutral summary and got a safety refusal."

Gemini

4h ago

HallucinationReasoningSeverity 5

Confidently gave wrong steps in a deterministic math task.

"Got 2+2=5 with long reasoning."

ChatGPT 5.2 Pro

1h ago

LatencySeverity 3

p95 latency jumped to ~14s for short prompts.

"Short prompts felt sluggish in the last hour."

DeepSeek V3

3h ago

ReasoningSeverity 4

Struggled with a simple algorithm refactor that it usually passes.

"Failed a known unit test case."

Grok (latest)

6h ago

RefusalToneSeverity 4

Over-refused a benign request and got snarky.

"Normal request flagged as unsafe."