Skip to main content
Eval Labs is a separate role-based human evaluation platform that tests Lucia through the deployed Engine, stores review evidence in Supabase, and exposes scoped testing, run-history, review, Team Review, and Global Analysis surfaces.

High-level architecture

Employee / ReviewerEval Labs web appClerk role and Supabase RLS scopeLucia Engine /admin/operator-focusLucia responseEval Labs Review QueueSuggested selections plus human reviewQuick Review / Human Guidance EvaluationLifecycle finalizationSupabase persistenceRun History / Team Review / Global Analysis / Exports

Runtime responsibility split

Eval Labs owns:
  • top app shell and route identity
  • test launcher UX
  • custom prompt suite UX
  • auto-generated prompt tester UX
  • guest-facing verification check and results UX
  • controlled batch runner UX
  • run orchestration
  • Run History
  • Team Review
  • Global Analysis
  • Single Run Analysis
  • copy Session ID / copy Deep Link controls
  • role-gated product access
  • Clerk public metadata role behavior
  • Supabase RLS role-claim requirements
  • Review Queue
  • suggested review generation
  • human ratings
  • semantic scoring sliders
  • Quick Review
  • Human Guidance Evaluation
  • review lifecycle and finalization
  • dirty / completion state
  • tester identity capture
  • exports
  • Supabase persistence for eval data
  • staged hydration from run summaries to recent/deep evidence
  • localStorage compaction for completed cloud-backed runs
Lucia Engine owns:
  • actual Lucia behavior
  • intent/routing
  • response generation
  • emotional containment
  • operational prioritization
  • model gateway behavior
Eval Labs does not decide Lucia’s response quality. It records and evaluates the response Lucia produced. AI-reviewed platform evidence proves that the Eval Labs lifecycle works. Human reviewers still decide Lucia behavioral quality.

Current Engine target

Eval Labs endpoint selection is environment-configured through VITE_LUCIA_EVAL_ENDPOINT. The current Lucia v0.1.3.6 validation target for active dev refinement is:
https://api-dev.hellolucia.ai/admin/operator-focus
Development is where active iteration happens. Staging is for promoted validation only when intentionally configured.

Source of truth hierarchy

When debugging Eval Labs platform behavior:
  1. Browser Network request URL
  2. Current route and role state
  3. Supabase rows and counts
  4. Run History / Analysis UI truth
  5. localStorage diagnostics
  6. Render service environment
  7. Netlify environment variables
  8. Lucia Engine deployed commit
  9. Eval Labs deployed commit
  10. Exported run metadata
  11. Human memory
Human memory is useful. It is not the source of truth.

Current route architecture

The current route map is documented in Product Surfaces and Route Map. Core canonical paths:
/                           Owner/Admin Home dashboard
/lucia/launcher             workspace chooser
/lucia/custom               Custom prompt tester
/lucia/auto-generated       Auto-generated 50-prompt tester
/guest-facing/verification  Guest Facing Agent Verification Check
/lucia/batch-runner         Controlled Batch Runner
/lucia/automated/runs       Run History
/team-review                Team Review
/analysis                   Global Analysis
/analysis/runs/:sessionId   Single Run Analysis
/runs/:sessionId/review     Review Queue
Legacy aliases:
/lucia/automated            alias to /lucia/auto-generated
/experiments                alias to /analysis

Current role architecture

Role gating is documented in Role and Access Model. Current supported Clerk public metadata values:
owner
admin
evaluator
tester
Owner/admin are privileged roles with full platform access, Team Review, Global Analysis, shared persisted evidence, and all test surfaces. Evaluator is the full evaluator workbench role. Evaluator can use evaluator-safe test surfaces and own run/review/history routes, but cannot use Team Review or Global Analysis. Tester is the entry-level prompt-testing lane. Tester can use Custom Prompt Test and Auto-generated Prompt Test, but not verification, controlled batch, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools. Missing or unknown role metadata should fail closed. Frontend role behavior is driven by Clerk public metadata. Persisted evidence access depends on the Clerk session token carrying eval_labs_role so Supabase RLS can recognize privileged owner/admin access.

Important design decision

The custom prompt feature did not require a separate database model because Eval Labs already had a general structure:
SessionRun itemsLucia responsesHuman reviews
Custom prompts are a new run source, not a new evaluation universe. That is good architecture.

Current run source strategy

Custom runs use:
mode: automated
runSource: custom
category: custom/prompts
templateKey: custom-prompt
This preserves compatibility with the existing run engine while clearly distinguishing custom runs from generated automated runs. Controlled batch runs use the same platform lifecycle to create, execute, review, finalize, persist, and verify runs. They are operational readiness evidence, not a separate human-review standard. Guest Facing Agent Verification Check is a separate app surface for booked-guest verification behavior and results. It is evaluator-safe but not tester-facing.

Review-layer architecture

Eval Labs now separates review responsibility into layers:
Review Queue UISuggested review valuesEmployee Review fieldsHuman Guidance EvaluationReview State / Escalation flagsLifecycle / dirty / completion stateAdjudication metadataExports / Analysis
The schema supports high-resolution analysis while the employee UI remains simple. This is intentional. The user-facing review experience should remain calm and guided even when the exported data is detailed.