Dataset Registry and Registry Diagnostics

Registry Diagnostics is the internal diagnostic truth page for inspecting how existing Eval Labs run and review data is being classified into Dataset Registry suggestions and Human Review Queue 2.0 lanes.

Status

Dataset Registry: implemented as seeded diagnostic taxonomy
Registry Diagnostics: implemented diagnostic surface
Suggested membership: derived
Review Queue lane suggestion: derived
Human behavioral label save: not part of this surface
Final employee workflow UX: deferred

Canonical route:

/registry-diagnostics

Legacy inbound alias:

/dataset-diagnostics

Current access intent is owner/admin. Do not describe this as a general evaluator workflow unless product access changes.

Plain-English definition

Registry Diagnostics is an internal diagnostic truth page that reads existing Eval Labs run and review data and shows how the derived Dataset Registry and Human Review Queue 2.0 model classify that data. It answers:

What datasets does this prompt/run appear to belong to?
Why did the model suggest that dataset?
What confidence did it have?
What source fields caused the match?
Which review queue lane is suggested?
Is the model overmatching, under-matching, or relying on weak evidence?

What it is

Registry Diagnostics is:

a diagnostic page
a read-only inspection surface
a way to inspect derived dataset membership suggestions
a way to inspect derived Human Review Queue 2.0 lane suggestions
a way to find weak evidence, broad matches, and overmatching patterns
a development and quality-control aid for the classification model

The important word is:

derived

Derived means the page is reading existing run and review fields and producing a suggested interpretation. It is not saving a new human decision.

What it is not

Registry Diagnostics is not:

a label creation workflow
a Behavioral Observatory review workflow
a place to save human behavioral review decisions
proof that a dataset membership is final
proof that a queue lane assignment is final
final employee workflow UX
a persistence guarantee

Do not use Registry Diagnostics language when you mean Behavioral Observatory saved labels.

Why it exists

Eval Labs now has enough run, review, analysis, and human judgment data that the product needs a way to inspect how that evidence is being classified. Without Registry Diagnostics, future developers could see a dataset card or queue lane and assume:

The system knows this belongs here.

That is too strong. The correct interpretation is:

The model found evidence that suggests this might belong here.

Registry Diagnostics exists to make that evidence visible before the model becomes trusted silently.

What data it reads

Registry Diagnostics reads existing scoped Eval Labs data, including run/session records, prompt/run items, saved or draft review records, suggested review fields, human review fields, adjudication hints, and other current store evidence. It does not create new dataset records. It does not create new queue decisions. It does not write Behavioral Observatory labels.

Suggested membership

A dataset membership suggestion means:

The classification model found evidence that this run item appears related to a canonical dataset.

It does not mean:

A human approved this dataset membership.

Use suggested membership to inspect evidence, not to claim final truth.

Confidence

Confidence is the model’s strength-of-evidence score for the suggestion. High confidence means the source fields gave stronger matching evidence. Low confidence means the suggestion is weaker and should be inspected carefully. Confidence is not human approval. It is a diagnostic score.

Source fields

Source fields explain why the suggestion happened. Examples of source evidence include:

prompt text
Lucia response text
run title
category or subcategory
saved review fields
suggested review fields
human label fields
adjudication fields
notes or review state

Source fields matter because they show whether the model matched on meaningful evidence or on a thin keyword.

Noise / Watch

Noise / Watch is the section for classification patterns that deserve human inspection. Use it to look for:

broad matches that may be too generic
items suggested for too many datasets
low-confidence suggestions
queue lanes triggered by weak evidence
possible overmatching
possible under-matching

Noise / Watch is not a release decision. It is a diagnostic warning area.

Who uses it

Registry Diagnostics is mainly for:

founders
owners/admins
product leads
engineers working on classification
senior reviewers inspecting review-routing behavior

Entry-level evaluators should not be expected to interpret Registry Diagnostics unless an owner/admin gives them a specific training task.

Step-by-step usage

Open /registry-diagnostics.
Read the top summary first.
Check how many canonical datasets exist.
Check how many runs and items are included in the scoped snapshot.
Inspect dataset cards.
Read the confidence level before trusting the suggestion.
Read the source fields to see why the match happened.
Inspect Human Review Queue 2.0 lane suggestions.
Use Noise / Watch to find weak, broad, or suspicious matches.
Write down classification issues for product or engineering follow-up.

Do not save Behavioral Observatory labels from this page. This page does not do that.

Canon rule

Registry Diagnostics = derived classification inspection.
Behavioral Observatory = saved reviewer behavioral labels.

Keep that distinction painful and visible.

​Status

​Plain-English definition

​What it is

​What it is not

​Why it exists

​What data it reads

​Suggested membership

​Confidence

​Source fields

​Noise / Watch

​Who uses it

​Step-by-step usage

​Canon rule

Status

Plain-English definition

What it is

What it is not

Why it exists

What data it reads

Suggested membership

Confidence

Source fields

Noise / Watch

Who uses it

Step-by-step usage

Canon rule