SVIB · Live Benchmark

The first live benchmark for verified intelligence.

Seedthink Verified Intelligence Benchmark (SVIB) runs a fresh, randomized test against every major LLM and publishes every prompt, answer, and score. No private datasets. No retries. No editing.

Run a real benchmark

Click below — watch it happen live.

Each run generates a fresh dataset, asks identical questions to every available system, scores the answers deterministically, and publishes the full audit trail. Limited to one run per hour per IP so the bill doesn't run away from us.

~60–120s · 1 run per hour per IP

Verified Intelligence Index (VII)

Rolling avg · last 10 runs

System	VII	KCP	Runs	Status
No runs yet. Click Run Live Benchmark above to be the first.

Only systems with valid API access are scored. Claude, Grok, and Perplexity require additional API keys to be enabled; until then they appear as Not configured rather than fabricated numbers.

The metric that matters

Knowledge Correction Persistence

When a fact changes, does the system permanently use the corrected information in all future answers? Every major LLM optimizes for accuracy. Almost none optimize for correction persistence — and that's exactly what Seedthink's architecture is built to solve.

Eight measurement axes

Hero metric

KCP

Knowledge Correction Persistence

Updates retained

PKB

Persistent Knowledge

Refusal accuracy

HRB

Hallucination Resistance

Graph-wide

KSC

Knowledge State Consistency

Exact citation

EPB

Enterprise Policy

Source verification

CAB

Citation Accuracy

Lower is better

KUL

Knowledge Update Latency

Lower is better

CVF

Cost per Verified Fact

Recent runs

When	Status	Seedthink VII	Audit
Awaiting first public run.

Methodology & integrity

Dataset

Every run generates a fresh dataset with randomized entity names, products, dates, org charts, and policies. The seed is stored in benchmark_runs.dataset_seed. No system has the opportunity to memorize the benchmark — the questions don't exist before the run starts.

Equal treatment

Every system receives the same FACTS block and the same QUESTION. No retries. No model-specific tuning of the user-visible prompt. Temperature is set to 0 across the board.

Scoring

All scoring is deterministic and happens in code, not by another LLM. Exact-substring matches for factual answers, refusal-phrase detection for hallucination resistance, structured format checks for citation accuracy.

VII weights

KCP 30% · PKB 15% · HRB 15% · KSC 10% · EPB 10% · CAB 10% · KUL 5% · CVF 5%. Weights are visible in the source and applied identically to every system.