Registry Diagnostics is the internal diagnostic truth page for inspecting how existing Eval Labs run and review data is being classified into Dataset Registry suggestions and Human Review Queue 2.0 lanes.
Status
Plain-English definition
Registry Diagnostics is an internal diagnostic truth page that reads existing Eval Labs run and review data and shows how the derived Dataset Registry and Human Review Queue 2.0 model classify that data. It answers:- What datasets does this prompt/run appear to belong to?
- Why did the model suggest that dataset?
- What confidence did it have?
- What source fields caused the match?
- Which review queue lane is suggested?
- Is the model overmatching, under-matching, or relying on weak evidence?
What it is
Registry Diagnostics is:- a diagnostic page
- a read-only inspection surface
- a way to inspect derived dataset membership suggestions
- a way to inspect derived Human Review Queue 2.0 lane suggestions
- a way to find weak evidence, broad matches, and overmatching patterns
- a development and quality-control aid for the classification model
What it is not
Registry Diagnostics is not:- a label creation workflow
- a Behavioral Observatory review workflow
- a place to save human behavioral review decisions
- proof that a dataset membership is final
- proof that a queue lane assignment is final
- final employee workflow UX
- a persistence guarantee
Why it exists
Eval Labs now has enough run, review, analysis, and human judgment data that the product needs a way to inspect how that evidence is being classified. Without Registry Diagnostics, future developers could see a dataset card or queue lane and assume:What data it reads
Registry Diagnostics reads existing scoped Eval Labs data, including run/session records, prompt/run items, saved or draft review records, suggested review fields, human review fields, adjudication hints, and other current store evidence. It does not create new dataset records. It does not create new queue decisions. It does not write Behavioral Observatory labels.Suggested membership
A dataset membership suggestion means:Confidence
Confidence is the model’s strength-of-evidence score for the suggestion. High confidence means the source fields gave stronger matching evidence. Low confidence means the suggestion is weaker and should be inspected carefully. Confidence is not human approval. It is a diagnostic score.Source fields
Source fields explain why the suggestion happened. Examples of source evidence include:- prompt text
- Lucia response text
- run title
- category or subcategory
- saved review fields
- suggested review fields
- human label fields
- adjudication fields
- notes or review state
Noise / Watch
Noise / Watch is the section for classification patterns that deserve human inspection. Use it to look for:- broad matches that may be too generic
- items suggested for too many datasets
- low-confidence suggestions
- queue lanes triggered by weak evidence
- possible overmatching
- possible under-matching
Who uses it
Registry Diagnostics is mainly for:- founders
- owners/admins
- product leads
- engineers working on classification
- senior reviewers inspecting review-routing behavior
Step-by-step usage
- Open
/registry-diagnostics. - Read the top summary first.
- Check how many canonical datasets exist.
- Check how many runs and items are included in the scoped snapshot.
- Inspect dataset cards.
- Read the confidence level before trusting the suggestion.
- Read the source fields to see why the match happened.
- Inspect Human Review Queue 2.0 lane suggestions.
- Use Noise / Watch to find weak, broad, or suspicious matches.
- Write down classification issues for product or engineering follow-up.

