Skip to main content
Registry Diagnostics is the internal diagnostic truth page for inspecting how existing Eval Labs run and review data is being classified into Dataset Registry suggestions and Human Review Queue 2.0 lanes.

Status

Dataset Registry: implemented as seeded diagnostic taxonomy
Registry Diagnostics: implemented diagnostic surface
Suggested membership: derived
Review Queue lane suggestion: derived
Human behavioral label save: not part of this surface
Final employee workflow UX: deferred
Canonical route:
/registry-diagnostics
Legacy inbound alias:
/dataset-diagnostics
Current access intent is owner/admin. Do not describe this as a general evaluator workflow unless product access changes.

Plain-English definition

Registry Diagnostics is an internal diagnostic truth page that reads existing Eval Labs run and review data and shows how the derived Dataset Registry and Human Review Queue 2.0 model classify that data. It answers:
  • What datasets does this prompt/run appear to belong to?
  • Why did the model suggest that dataset?
  • What confidence did it have?
  • What source fields caused the match?
  • Which review queue lane is suggested?
  • Is the model overmatching, under-matching, or relying on weak evidence?

What it is

Registry Diagnostics is:
  • a diagnostic page
  • a read-only inspection surface
  • a way to inspect derived dataset membership suggestions
  • a way to inspect derived Human Review Queue 2.0 lane suggestions
  • a way to find weak evidence, broad matches, and overmatching patterns
  • a development and quality-control aid for the classification model
The important word is:
derived
Derived means the page is reading existing run and review fields and producing a suggested interpretation. It is not saving a new human decision.

What it is not

Registry Diagnostics is not:
  • a label creation workflow
  • a Behavioral Observatory review workflow
  • a place to save human behavioral review decisions
  • proof that a dataset membership is final
  • proof that a queue lane assignment is final
  • final employee workflow UX
  • a persistence guarantee
Do not use Registry Diagnostics language when you mean Behavioral Observatory saved labels.

Why it exists

Eval Labs now has enough run, review, analysis, and human judgment data that the product needs a way to inspect how that evidence is being classified. Without Registry Diagnostics, future developers could see a dataset card or queue lane and assume:
The system knows this belongs here.
That is too strong. The correct interpretation is:
The model found evidence that suggests this might belong here.
Registry Diagnostics exists to make that evidence visible before the model becomes trusted silently.

What data it reads

Registry Diagnostics reads existing scoped Eval Labs data, including run/session records, prompt/run items, saved or draft review records, suggested review fields, human review fields, adjudication hints, and other current store evidence. It does not create new dataset records. It does not create new queue decisions. It does not write Behavioral Observatory labels.

Suggested membership

A dataset membership suggestion means:
The classification model found evidence that this run item appears related to a canonical dataset.
It does not mean:
A human approved this dataset membership.
Use suggested membership to inspect evidence, not to claim final truth.

Confidence

Confidence is the model’s strength-of-evidence score for the suggestion. High confidence means the source fields gave stronger matching evidence. Low confidence means the suggestion is weaker and should be inspected carefully. Confidence is not human approval. It is a diagnostic score.

Source fields

Source fields explain why the suggestion happened. Examples of source evidence include:
  • prompt text
  • Lucia response text
  • run title
  • category or subcategory
  • saved review fields
  • suggested review fields
  • human label fields
  • adjudication fields
  • notes or review state
Source fields matter because they show whether the model matched on meaningful evidence or on a thin keyword.

Noise / Watch

Noise / Watch is the section for classification patterns that deserve human inspection. Use it to look for:
  • broad matches that may be too generic
  • items suggested for too many datasets
  • low-confidence suggestions
  • queue lanes triggered by weak evidence
  • possible overmatching
  • possible under-matching
Noise / Watch is not a release decision. It is a diagnostic warning area.

Who uses it

Registry Diagnostics is mainly for:
  • founders
  • owners/admins
  • product leads
  • engineers working on classification
  • senior reviewers inspecting review-routing behavior
Entry-level evaluators should not be expected to interpret Registry Diagnostics unless an owner/admin gives them a specific training task.

Step-by-step usage

  1. Open /registry-diagnostics.
  2. Read the top summary first.
  3. Check how many canonical datasets exist.
  4. Check how many runs and items are included in the scoped snapshot.
  5. Inspect dataset cards.
  6. Read the confidence level before trusting the suggestion.
  7. Read the source fields to see why the match happened.
  8. Inspect Human Review Queue 2.0 lane suggestions.
  9. Use Noise / Watch to find weak, broad, or suspicious matches.
  10. Write down classification issues for product or engineering follow-up.
Do not save Behavioral Observatory labels from this page. This page does not do that.

Canon rule

Registry Diagnostics = derived classification inspection.
Behavioral Observatory = saved reviewer behavioral labels.
Keep that distinction painful and visible.