Methodology
How Dumb Meter scores today
We track daily drift against each model's own rolling baseline. The goal is not to crown a winner, but to flag unusual dips in performance with explainable drivers.
Auto Index
Weighted drift
Accuracy, refusal, latency, variance, etc.
Human Index
Community signal
Reports + corroborations.
Cadence
Daily runs
Same suite, same baselines.
Auto index
What goes into the score
Accuracy drift
Compared to a 21-day baseline, normalized to a 0-100 badness scale.
Reasoning robustness
Compared to a 21-day baseline, normalized to a 0-100 badness scale.
Instruction following
Compared to a 21-day baseline, normalized to a 0-100 badness scale.
Hallucination risk
Compared to a 21-day baseline, normalized to a 0-100 badness scale.
Refusal anomalies
Compared to a 21-day baseline, normalized to a 0-100 badness scale.
Latency and variance
Compared to a 21-day baseline, normalized to a 0-100 badness scale.
Human index
Community reporting loop
Reports are aggregated daily, weighted by corroboration. We do not store full prompts, and we warn users if personal data appears.
Visualization
How to read the gauge
The gauge shows today's auto dumb index. Higher means the model is further above its own baseline. We never compare models directly for "best overall."
- 0-25: Sharp
- 26-50: Normal
- 51-75: Sus
- 76-100: Emergency
Sample
67
Sus
vs baseline +11
Disclaimers