The first live benchmark for verified intelligence.
Seedthink Verified Intelligence Benchmark (SVIB) runs a fresh, randomized test against every major LLM and publishes every prompt, answer, and score. No private datasets. No retries. No editing.
Click below — watch it happen live.
Each run generates a fresh dataset, asks identical questions to every available system, scores the answers deterministically, and publishes the full audit trail. Limited to one run per hour per IP so the bill doesn't run away from us.
~60–120s · 1 run per hour per IP
Verified Intelligence Index (VII)
Rolling avg · last 10 runs| System | VII | KCP | Runs | Status |
|---|---|---|---|---|
| No runs yet. Click Run Live Benchmark above to be the first. | ||||
Only systems with valid API access are scored. Claude, Grok, and Perplexity require additional API keys to be enabled; until then they appear as Not configured rather than fabricated numbers.
Knowledge Correction Persistence
When a fact changes, does the system permanently use the corrected information in all future answers? Every major LLM optimizes for accuracy. Almost none optimize for correction persistence — and that's exactly what Seedthink's architecture is built to solve.
Eight measurement axes
Recent runs
| When | Status | Seedthink VII | Audit |
|---|---|---|---|
| Awaiting first public run. | |||
Methodology & integrity
Dataset
Every run generates a fresh dataset with randomized entity names, products, dates, org charts, and policies. The seed is stored in benchmark_runs.dataset_seed. No system has the opportunity to memorize the benchmark — the questions don't exist before the run starts.
Equal treatment
Every system receives the same FACTS block and the same QUESTION. No retries. No model-specific tuning of the user-visible prompt. Temperature is set to 0 across the board.
Scoring
All scoring is deterministic and happens in code, not by another LLM. Exact-substring matches for factual answers, refusal-phrase detection for hallucination resistance, structured format checks for citation accuracy.
VII weights
KCP 30% · PKB 15% · HRB 15% · KSC 10% · EPB 10% · CAB 10% · KUL 5% · CVF 5%. Weights are visible in the source and applied identically to every system.