SVIB · Live Benchmark

The first live benchmark for verified intelligence.

Seedthink Verified Intelligence Benchmark (SVIB) runs a fresh, randomized test against every major LLM and publishes every prompt, answer, and score. No private datasets. No retries. No editing.

Run a real benchmark

Click below — watch it happen live.

Each run generates a fresh dataset, asks identical questions to every available system, scores the answers deterministically, and publishes the full audit trail. Limited to one run per hour per IP so the bill doesn't run away from us.

~60–120s · 1 run per hour per IP

Verified Intelligence Index (VII)

Rolling avg · last 10 runs
SystemVIIKCPRunsStatus
No runs yet. Click Run Live Benchmark above to be the first.

Only systems with valid API access are scored. Claude, Grok, and Perplexity require additional API keys to be enabled; until then they appear as Not configured rather than fabricated numbers.

The metric that matters

Knowledge Correction Persistence

When a fact changes, does the system permanently use the corrected information in all future answers? Every major LLM optimizes for accuracy. Almost none optimize for correction persistence — and that's exactly what Seedthink's architecture is built to solve.

Eight measurement axes

Hero metric
KCP
Knowledge Correction Persistence
Updates retained
PKB
Persistent Knowledge
Refusal accuracy
HRB
Hallucination Resistance
Graph-wide
KSC
Knowledge State Consistency
Exact citation
EPB
Enterprise Policy
Source verification
CAB
Citation Accuracy
Lower is better
KUL
Knowledge Update Latency
Lower is better
CVF
Cost per Verified Fact

Recent runs

WhenStatusSeedthink VIIAudit
Awaiting first public run.

Methodology & integrity

Dataset

Every run generates a fresh dataset with randomized entity names, products, dates, org charts, and policies. The seed is stored in benchmark_runs.dataset_seed. No system has the opportunity to memorize the benchmark — the questions don't exist before the run starts.

Equal treatment

Every system receives the same FACTS block and the same QUESTION. No retries. No model-specific tuning of the user-visible prompt. Temperature is set to 0 across the board.

Scoring

All scoring is deterministic and happens in code, not by another LLM. Exact-substring matches for factual answers, refusal-phrase detection for hallucination resistance, structured format checks for citation accuracy.

VII weights

KCP 30% · PKB 15% · HRB 15% · KSC 10% · EPB 10% · CAB 10% · KUL 5% · CVF 5%. Weights are visible in the source and applied identically to every system.