Methodology

How Dumb Meter scores today

We track daily drift against each model's own rolling baseline. The goal is not to crown a winner, but to flag unusual dips in performance with explainable drivers.

Auto Index

Weighted drift

Accuracy, refusal, latency, variance, etc.

Human Index

Community signal

Reports + corroborations.

Cadence

Daily runs

Same suite, same baselines.

Auto index

What goes into the score

Accuracy drift

Compared to a 21-day baseline, normalized to a 0-100 badness scale.

Reasoning robustness

Compared to a 21-day baseline, normalized to a 0-100 badness scale.

Instruction following

Compared to a 21-day baseline, normalized to a 0-100 badness scale.

Hallucination risk

Compared to a 21-day baseline, normalized to a 0-100 badness scale.

Refusal anomalies

Compared to a 21-day baseline, normalized to a 0-100 badness scale.

Latency and variance

Compared to a 21-day baseline, normalized to a 0-100 badness scale.

Human index

Community reporting loop

Fast reportCategory tagsSeverity 1-5Confirm or denyConfidence score

Reports are aggregated daily, weighted by corroboration. We do not store full prompts, and we warn users if personal data appears.

Visualization

How to read the gauge

The gauge shows today's auto dumb index. Higher means the model is further above its own baseline. We never compare models directly for "best overall."

  • 0-25: Sharp
  • 26-50: Normal
  • 51-75: Sus
  • 76-100: Emergency

Sample

050100

67

Sus

vs baseline +11

Sample 67 (Sus), +11

Disclaimers

What this is not

Not a full benchmark suite.
Not real-time monitoring.
Not a leaderboard of all-time best.
No sensitive logs or personal data.