Eval Labs Step-by-Step Operator Guide

This is the simple operator guide for the major Eval Labs surfaces. Use it to avoid confusing diagnostic pages, review pages, analysis pages, and persistence truth.

Status labels

Use these words exactly:

implemented = present product/code path
active hardening = implemented path still being validated, polished, or tightened
diagnostic = read-only inspection
derived = suggested from existing data
persisted = saved and reloadable from Supabase
future = planned or possible later
deferred = intentionally outside current role/surface scope

A. Home

Use Home when you need the owner/admin overview. Home shows:

platform status
recent activity
quick access to major surfaces
product state
readiness or evidence summaries when available

Do not treat Home as a human approval page. Home can show platform progress. Human Lucia-quality approval still comes from human review.

B. Registry Diagnostics

Use Registry Diagnostics when you need to inspect derived classification behavior. Route:

/registry-diagnostics

Why it exists:

To show why existing Eval Labs data appears to match datasets and review queue lanes.

How to read the top summary:

Check how many datasets exist.
Check how many runs are included.
Check how many suggested memberships exist.
Check confidence breakdowns.
Treat the whole page as diagnostic.

How to inspect dataset cards:

Read the dataset name.
Read the suggested membership count.
Check confidence.
Check source fields.
Ask whether the evidence is real or thin.

How to inspect queue lanes:

Read the suggested lane.
Check which items triggered it.
Check confidence.
Look for broad or weak matches.

How to use Noise / Watch:

Look for overmatching.
Look for weak low-confidence matches.
Look for items suggested for too many datasets.
Write down issues for product or engineering.

What not to assume:

Do not assume dataset membership is final.
Do not assume queue routing is final.
Do not assume a human saved a label.
Do not use this page as Behavioral Observatory.

C. Behavioral Observatory

Use Behavioral Observatory when you need to review conversations and save behavioral labels. Route:

/behavioral-observatory

Before using it:

Confirm your role and assignment allow access.
Confirm the data shown is from the scoped run context you intend to review.
Notice whether the selected conversation is showing a derived suggestion or a saved label.

Select a conversation:

Open the labeling queue.
Pick one conversation.
Read the Human message first.
Read Lucia’s response second.

Set Intent:

Choose what the human was trying to do.
Use Other only when the listed categories do not fit.

Set Guest Affect:

Choose the smallest truthful emotional read.
Do not dramatize the guest’s state.

Set Response Strategy:

Choose Lucia’s main response move.
Pick the dominant strategy, not every strategy present.

Set Humanness:

= Template
= Functional
= Warm + Specific

Do not use humanness as a substitute for truth, usefulness, or safety. Add notes:

Add a note only when it preserves useful behavioral evidence.
Keep it short.
Name the behavior and why it matters.

Save label:

Click Save label or Save updates.
Wait for the saved state.
If the state is error, do not count the label as persisted.

Refresh/check saved status if needed:

Refresh the page.
Confirm the label reloads.
If it does not reload, treat the label as not verified.

D. Guest Facing Agent Verification

Use Guest Facing Agent Verification when you need to run or inspect booked-guest verification behavior. Routes:

/guest-facing/verification
/guest-facing/verification/results

Verification is for:

running the scenario pack from the app surface
inspecting pass/fail results
reviewing failure details
exporting or copying verification summaries

Verification is not:

a tester lane
Team Review
Global Analysis
proof that Lucia is human-approved

E. Team Review

Use Team Review when owner/admin needs oversight of evaluator activity and review quality. Route:

/team-review

Team Review is for:

inspecting evaluator activity
finding missing checks
spotting flags and failures
reviewing recent human-evaluation signal
deciding what needs owner/admin attention

Team Review is not:

evaluator productivity tracking for its own sake
tester onboarding
a human approval page

F. Global Analysis

Use Global Analysis when you need read-only behavioral and analytics evidence. Route:

/analysis

Global Analysis is for:

inspecting completed run evidence
reading behavioral summaries
comparing patterns
opening Single Run Analysis when available

Global Analysis is not:

a human approval page
a Behavioral Observatory label-save workflow
Registry Diagnostics

G. Run History

Use Run History when you need the run ledger. Route:

/lucia/automated/runs

Run History is for:

finding completed/finalized runs
checking run lifecycle truth
copying run/session identifiers
opening review or analysis routes when allowed

Run History is not:

proof that a response was good
proof that Behavioral Observatory labels exist
proof that Lucia is human-approved

Operator rule

Use the right surface for the right truth:

Run History = run ledger truth
Team Review = owner/admin oversight truth
Global Analysis = read-only evidence truth
Registry Diagnostics = derived classification truth
Behavioral Observatory = saved behavioral label truth
Review Queue = prompt/item review workflow
Human reviewers = Lucia quality judgment

​Status labels

​A. Home

​B. Registry Diagnostics

​C. Behavioral Observatory

​D. Guest Facing Agent Verification

​E. Team Review

​F. Global Analysis

​G. Run History

​Operator rule

Status labels

A. Home

B. Registry Diagnostics

C. Behavioral Observatory

D. Guest Facing Agent Verification

E. Team Review

F. Global Analysis

G. Run History

Operator rule