Skip to main content
Eval Labs now has separate surfaces for owner/admin oversight, evaluator workbench testing, tester prompt-testing, guest-facing verification, controlled platform checks, derived Registry Diagnostics, persisted Behavioral Observatory labels, Run History, Team Review, Global Analysis, Single Run Analysis, and Review Queue work.

Route map

RouteSurfaceCurrent access intent
/Role-aware home / landingOwner/admin/evaluator/tester by role
/lucia/launcherLauncher / workspace chooserOwner/admin/evaluator/tester
/lucia/customCustom Prompt TestOwner/admin/evaluator/tester
/lucia/custom/suites/:suiteIdCustom saved suite deep linkOwner/admin
/lucia/auto-generatedAuto-generated Prompt TestOwner/admin/evaluator/tester
/lucia/automatedLegacy alias to auto-generated testerOwner/admin/evaluator/tester
/guest-facing/verificationGuest Facing Agent Verification CheckOwner/admin/evaluator
/guest-facing/verification?view=checkGuest Facing Agent Verification CheckOwner/admin/evaluator
/guest-facing/verification/resultsVerification ResultsOwner/admin/evaluator
/lucia/batch-runnerControlled Batch RunnerOwner/admin/evaluator
/lucia/automated/runsRun HistoryOwner/admin global; evaluator/tester own or scoped runs
/team-reviewTeam Review overviewOwner/admin
/team-review/evaluators/:evaluatorKeyTeam Review evaluator detailOwner/admin
/registry-diagnosticsRegistry DiagnosticsOwner/admin
/dataset-diagnosticsLegacy Registry Diagnostics aliasOwner/admin
/behavioral-observatoryBehavioral ObservatoryOwner/admin
/analysisGlobal AnalysisOwner/admin
/experimentsLegacy Global Analysis aliasOwner/admin
/analysis/runs/:sessionIdSingle Run AnalysisOwner/admin
/runs/:sessionId/runningIn-flight run routeOwner/admin; evaluator/tester only for own scoped runs
/runs/:sessionId/reviewReview QueueOwner/admin; evaluator/tester only for own scoped runs
/runs/:sessionId/review?eval=:caseIdDirect eval-item review linkOwner/admin; evaluator/tester only for own scoped runs

Surface definitions

Role-aware home

The home route is role-aware. Owner/admin see the privileged platform overview. Evaluator and tester users should see onboarding/workspace entry points appropriate to their role.

Launcher

The Launcher is the workspace chooser. It separates:
  • Custom Prompt Test
  • Auto-generated Prompt Test
  • Guest Facing Agent Verification Check
  • Controlled Batch Runner
The top app shell owns page identity. The older in-page blog-style masthead pattern has been removed from the product surface.

Custom prompt tester

The Custom Prompt Test lets a user enter 1-10 exact prompts. Use it for targeted testing, evaluator work, and repeatable behavior-family review. Owner/admin, evaluator, and tester roles can use this surface.

Auto-generated prompt tester

The Auto-generated Prompt Test runs the normal generated 50-prompt test. Owner/admin, evaluator, and tester roles can use this surface. It is separate from controlled batch infrastructure.

Guest Facing Agent Verification Check

Guest Facing Agent Verification Check runs the booked-guest verification scenario pack through the app surface. Owner/admin and evaluator roles can use this surface. Tester cannot use this surface.

Verification Results

Verification Results shows the saved/current Guest Facing Agent verification output, scenario failures, exports, and copied summaries. Owner/admin and evaluator roles can use this surface. Tester cannot use this surface.

Controlled batch runner

The Controlled Batch Runner is controlled platform-readiness tooling. It supports:
  • 1-run smoke
  • 3-run checkpoint
  • 10-run checkpoint
It was used for the 60-run readiness gate. Owner/admin and evaluator roles can use it in the current role model. Tester cannot use it.

Run History

Run History is the scoped run ledger. It includes completed/finalized run truth and may show scoped operational state. Owner/admin can inspect shared/global persisted evidence. Evaluator and tester access is scoped to their own allowed work.

Team Review

Team Review is the owner/admin oversight surface. It groups evaluator activity, flags review gaps, and helps owner/admin decide where human evaluation signal needs inspection. It is not available to evaluator, tester, or unassigned users.

Registry Diagnostics

Registry Diagnostics is the read-only diagnostic surface for the Dataset Registry and Human Review Queue 2.0 classification model. It shows derived suggestions from existing Eval Labs run/review data:
  • dataset membership suggestions
  • confidence
  • source fields
  • queue lane suggestions
  • Noise / Watch classification patterns
It does not create labels, save human behavioral decisions, or prove final dataset membership. /registry-diagnostics is canonical. /dataset-diagnostics remains a legacy inbound alias.

Behavioral Observatory

Behavioral Observatory is the first-class behavioral labeling surface. It lets a reviewer inspect conversations and save structured labels:
  • Intent
  • Guest Affect
  • Response Strategy
  • Humanness
  • Notes
Saved labels are persisted Behavioral Observatory evidence only after Supabase confirms the save. Derived suggestions on this page are starting context, not final judgment.

Global Analysis

Global Analysis is the read-only owner/admin behavioral and analytics surface. It is AI-analyzed platform evidence, not human Lucia-quality approval. /analysis is canonical. /experiments remains a legacy alias.

Single Run Analysis

Single Run Analysis is read-only analysis of one completed run/session. It includes run metadata, behavioral summaries, item rows, and copy/deep-link controls when hydrated data is available.

Review Queue

The Review Queue is the scoring and review workflow for prompts/items. In the current role model, evaluator and tester review access is scoped to their own allowed work. Owner/admin can inspect shared persisted evidence where oversight applies.

Copy controls

Copy Session ID and Copy Deep Link controls exist across key surfaces, including run rows, controlled batch summaries, Single Run Analysis, and review/item contexts. These controls are addressability infrastructure. They make future debugging, review handoff, and Canon recovery easier.

Surface distinction rule

Use this map:
Run History = run ledger truth
Team Review = owner/admin oversight truth
Global Analysis = read-only platform evidence truth
Registry Diagnostics = derived classification truth
Behavioral Observatory = saved behavioral label truth
Review Queue = prompt/item review workflow
Do not claim a diagnostic suggestion is a saved label.