Measuring Verified Intelligence

Five interlocking metrics — VII, KRS, KCI, VIMM, and CIB — that turn "is this AI good?" into something you can measure, move, and improve over time.

June 15, 2026 · Seedthink Labs

Why a new measurement stack

Most AI benchmarks score a frozen model on a frozen test set at a single point in time. That tells you very little about what matters in production: whether the system's knowledge is trustworthy today, whether it agrees with itself, and whether it is getting better or quietly drifting.

Seedthink is not a frozen model. It is a continuously learning system that separates a lightweight reasoner from a verified knowledge graph. To make sense of how that system improves, we measure it across five orthogonal axes — each one targets a specific failure mode of conventional LLMs.

VII — Verified Intelligence Index

VII is the composite headline score. It rolls reasoning quality, knowledge reliability, consistency, and freshness into a single index of how much of the system's output is genuinely verified versus merely plausible.

Think of VII as the integral of every other metric on this page. When KRS rises because the verification engine is catching more bad facts, VII moves. When KCI rises because contradictions are being resolved, VII moves. When the reasoner climbs a VIMM stage, VII moves. It is the metric the whole architecture is optimised to push upward, slowly and durably.

KRS — Knowledge Reliability Score

Every fact in the graph carries its own KRS. It is derived from four inputs: source quality, multi-model consensus at the moment of extraction, survival under re-verification cycles, and corroboration by independent facts elsewhere in the graph.

Low-KRS facts are quarantined — they can be retrieved for inspection but they cannot be cited and they cannot be distilled into the reasoner. High-KRS facts graduate into the trusted substrate. KRS is the gate between "the model said this once" and "the system knows this."

KCI — Knowledge Consistency Index

KRS scores facts individually. KCI scores how well those facts agree with each other. A graph full of high-KRS facts that quietly contradict each other across sources, time periods, or related entities is still broken — KCI is the metric that catches it.

Rising KCI means contradictions are being detected and resolved faster than new ones are being introduced. Falling KCI is an early-warning signal that a particular domain is drifting and needs human review before it pollutes downstream answers.

VIMM — Verified Intelligence Maturity Model

VIMM is the staged roadmap each Seed climbs. The stages, in order, are:

Stage 1 — Retrieval: the Seed can find relevant material in its corpus.
Stage 2 — Verified recall: the Seed only surfaces facts that have cleared verification.
Stage 3 — Structured reasoning: the Seed composes verified facts into multi-hop answers with traceable provenance.
Stage 4 — Self-correction: the Seed detects its own contradictions and re-verifies them without prompting.
Stage 5 — Knowledge stewardship: the Seed proactively maintains its own graph — pruning stale facts, requesting fresh sources, distilling structure back into the reasoner.

VIMM turns the vague question "is this AI good?" into the actionable question "which maturity stage is this Seed operating at, and what is blocking the next one?"

CIB — Continuous Intelligence Benchmark

Static benchmarks go stale the moment models memorise them. CIB is a rolling, continuously refreshed evaluation harness that re-runs against the live Seed every cycle — using fresh probes, held-out facts, and adversarial challenges drawn from the verification pipeline itself.

The point of CIB is not a single score but a slope. VII, KRS, KCI, and VIMM all become time-series under CIB. Improvement becomes measurable. Regressions become detectable within a cycle, not after a quarter.

How they fit together

The five metrics interlock by design:

KRS scores individual facts.
KCI checks whether those facts agree with each other.
VII rolls both into a single index of verified output quality.
VIMM describes where the Seed sits on the long curve from retrieval to stewardship.
CIB tracks all of it over time so improvement is measurable, not anecdotal.

Move any one of them and the others register the change. That is the difference between a benchmark and an instrument: an instrument tells you which knob to turn next.

What this enables

With this stack in place, a Seed is no longer a black box you either trust or don't. It is a system with a published reliability number per fact, a published consistency number per domain, a published maturity stage, a published improvement slope, and a single composite index you can point at to say: "this is more verified today than it was last week, and here is exactly why."

That is what verified intelligence infrastructure looks like when you actually measure it.