Skip to main content
This is the simple operator guide for the major Eval Labs surfaces. Use it to avoid confusing diagnostic pages, review pages, analysis pages, and persistence truth.

Status labels

Use these words exactly:
implemented = present product/code path
active hardening = implemented path still being validated, polished, or tightened
diagnostic = read-only inspection
derived = suggested from existing data
persisted = saved and reloadable from Supabase
future = planned or possible later
deferred = intentionally outside current role/surface scope

A. Home

Use Home when you need the owner/admin overview. Home shows:
  • platform status
  • recent activity
  • quick access to major surfaces
  • product state
  • readiness or evidence summaries when available
Do not treat Home as a human approval page. Home can show platform progress. Human Lucia-quality approval still comes from human review.

B. Registry Diagnostics

Use Registry Diagnostics when you need to inspect derived classification behavior. Route:
/registry-diagnostics
Why it exists:
To show why existing Eval Labs data appears to match datasets and review queue lanes.
How to read the top summary:
  1. Check how many datasets exist.
  2. Check how many runs are included.
  3. Check how many suggested memberships exist.
  4. Check confidence breakdowns.
  5. Treat the whole page as diagnostic.
How to inspect dataset cards:
  1. Read the dataset name.
  2. Read the suggested membership count.
  3. Check confidence.
  4. Check source fields.
  5. Ask whether the evidence is real or thin.
How to inspect queue lanes:
  1. Read the suggested lane.
  2. Check which items triggered it.
  3. Check confidence.
  4. Look for broad or weak matches.
How to use Noise / Watch:
  1. Look for overmatching.
  2. Look for weak low-confidence matches.
  3. Look for items suggested for too many datasets.
  4. Write down issues for product or engineering.
What not to assume:
  • Do not assume dataset membership is final.
  • Do not assume queue routing is final.
  • Do not assume a human saved a label.
  • Do not use this page as Behavioral Observatory.

C. Behavioral Observatory

Use Behavioral Observatory when you need to review conversations and save behavioral labels. Route:
/behavioral-observatory
Before using it:
  1. Confirm your role and assignment allow access.
  2. Confirm the data shown is from the scoped run context you intend to review.
  3. Notice whether the selected conversation is showing a derived suggestion or a saved label.
Select a conversation:
  1. Open the labeling queue.
  2. Pick one conversation.
  3. Read the Human message first.
  4. Read Lucia’s response second.
Set Intent:
  1. Choose what the human was trying to do.
  2. Use Other only when the listed categories do not fit.
Set Guest Affect:
  1. Choose the smallest truthful emotional read.
  2. Do not dramatize the guest’s state.
Set Response Strategy:
  1. Choose Lucia’s main response move.
  2. Pick the dominant strategy, not every strategy present.
Set Humanness:
1 = Template
4 = Functional
7 = Warm + Specific
Do not use humanness as a substitute for truth, usefulness, or safety. Add notes:
  1. Add a note only when it preserves useful behavioral evidence.
  2. Keep it short.
  3. Name the behavior and why it matters.
Save label:
  1. Click Save label or Save updates.
  2. Wait for the saved state.
  3. If the state is error, do not count the label as persisted.
Refresh/check saved status if needed:
  1. Refresh the page.
  2. Confirm the label reloads.
  3. If it does not reload, treat the label as not verified.

D. Guest Facing Agent Verification

Use Guest Facing Agent Verification when you need to run or inspect booked-guest verification behavior. Routes:
/guest-facing/verification
/guest-facing/verification/results
Verification is for:
  • running the scenario pack from the app surface
  • inspecting pass/fail results
  • reviewing failure details
  • exporting or copying verification summaries
Verification is not:
  • a tester lane
  • Team Review
  • Global Analysis
  • proof that Lucia is human-approved

E. Team Review

Use Team Review when owner/admin needs oversight of evaluator activity and review quality. Route:
/team-review
Team Review is for:
  • inspecting evaluator activity
  • finding missing checks
  • spotting flags and failures
  • reviewing recent human-evaluation signal
  • deciding what needs owner/admin attention
Team Review is not:
  • evaluator productivity tracking for its own sake
  • tester onboarding
  • a human approval page

F. Global Analysis

Use Global Analysis when you need read-only behavioral and analytics evidence. Route:
/analysis
Global Analysis is for:
  • inspecting completed run evidence
  • reading behavioral summaries
  • comparing patterns
  • opening Single Run Analysis when available
Global Analysis is not:
  • a human approval page
  • a Behavioral Observatory label-save workflow
  • Registry Diagnostics

G. Run History

Use Run History when you need the run ledger. Route:
/lucia/automated/runs
Run History is for:
  • finding completed/finalized runs
  • checking run lifecycle truth
  • copying run/session identifiers
  • opening review or analysis routes when allowed
Run History is not:
  • proof that a response was good
  • proof that Behavioral Observatory labels exist
  • proof that Lucia is human-approved

Operator rule

Use the right surface for the right truth:
Run History = run ledger truth
Team Review = owner/admin oversight truth
Global Analysis = read-only evidence truth
Registry Diagnostics = derived classification truth
Behavioral Observatory = saved behavioral label truth
Review Queue = prompt/item review workflow
Human reviewers = Lucia quality judgment