Skip to main content
Eval Labs is Lucia’s role-based human evaluation and platform-readiness infrastructure. It lets authorized users test, score, annotate, analyze, compare, oversee, and improve AI behavior over time.

Definition

Eval Labs is the internal product and workflow used to evaluate Lucia’s responses with role-based human judgment. It is not simply a prompt runner. It is first-class Lucia intelligence infrastructure: the place where Lucia’s behavior is tested, inspected, reviewed, analyzed, and improved. Eval Labs captures:
  • the prompt
  • Lucia’s response
  • the run source
  • the suite context
  • the human evaluator
  • ratings
  • suggested review signal
  • quick employee review
  • Human Guidance Evaluation
  • pass/fail state
  • written notes
  • review lifecycle
  • adjudication metadata
  • role-gated access state
  • Run History evidence
  • Analysis and Single Run Analysis evidence
  • Registry Diagnostics evidence
  • Dataset Registry suggestion evidence
  • Human Review Queue 2.0 lane suggestion evidence
  • Behavioral Observatory labels
  • exports for analysis
  • dirty / completion state
  • behavior patterns over time

Why Eval Labs exists

Generic AI benchmarks are not enough for Lucia. Lucia is not being built to win trivia tests. Lucia is being built to help hospitality operators stay oriented, make good decisions, and trust the system under real pressure. That means we must evaluate qualities that normal benchmarks miss:
calm
warmth
trust
intent accuracy
operational usefulness
containment
truth-state discipline
operator cognitive load

The product principle

Eval Labs exists because Lucia cannot become excellent by vibes. Lucia needs repeated, inspectable, human-scored evaluation against real behavioral expectations. The product also needs controlled readiness evidence that proves the evaluation platform itself can create runs, capture responses, persist reviews, finalize sessions, hydrate Run History, hydrate Analysis, and keep local client state compact. The system must help us answer:
  • Did Lucia understand the user?
  • Did Lucia choose the right mode?
  • Did Lucia say too much or too little?
  • Did Lucia create calm or noise?
  • Did Lucia preserve trust?
  • Did Lucia overclaim?
  • Did Lucia give the right next move?

What Eval Labs does today

Eval Labs currently supports:
  • custom 1–10 prompt tests
  • saved custom prompt suites
  • auto-generated prompt tests
  • Guest Facing Agent Verification Check and Verification Results
  • Controlled Batch Runner readiness checks
  • shared Review Queue scoring
  • suggested selections
  • semantic scoring sliders
  • Quick Review
  • Human Guidance Evaluation
  • lifecycle finalization
  • Run History
  • Team Review for owner/admin oversight
  • Global Analysis
  • Single Run Analysis
  • Registry Diagnostics for derived Dataset Registry and Human Review Queue 2.0 inspection
  • Behavioral Observatory for saved reviewer behavioral labels
  • copy Session ID / copy Deep Link controls across key surfaces
  • Clerk role gating for owner, admin, evaluator, and tester
  • Supabase RLS protection for persisted evidence when the Clerk session carries eval_labs_role
  • tester identity capture through Clerk
  • JSON / CSV / Markdown exports
  • run source tagging: custom, automated, manual
  • Supabase persistence for suites, runs, items, and reviews
  • live testing against the active dev Lucia Engine

What Eval Labs is not

Eval Labs is not:
a generic chat app
a one-off prompt playground
a rubber-stamp review form
a dataset label factory
a vague behavior dashboard
a replacement for product judgment
a replacement for Lucia doctrine
a claim that Lucia is human-approved
a complete backend authorization boundary by itself
Eval Labs is:
Lucia-native behavioral judgment infrastructure

Who uses Eval Labs

Eval Labs is designed for:
  • founders
  • owners/admins running platform and quality gates
  • Lucia evaluators
  • testers in the entry-level prompt-testing lane
  • employees testing approved prompt workflows
  • engineers validating behavior changes
  • product leads reviewing quality patterns
  • future QA and evaluation teams
Owner/admin have full platform access, shared persisted evidence, Team Review, Global Analysis, and all current test surfaces. Evaluator is the full evaluator workbench role. Evaluators can use evaluator-safe test types and their own run/review/history routes, but they do not see Team Review or Global Analysis. Tester is the narrower onboarding role. Testers can use Custom Prompt Test and Auto-generated Prompt Test, but they cannot use verification, controlled batch, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools. Read the canonical matrix: Eval Labs Roles and Access Matrix.

Why saved suites matter

Saved custom prompt suites changed Eval Labs from a testing slot machine into a regression lab. Before saved suites, reviewers could run broad generated tests. Now reviewers can repeatedly test the same exact prompts while Lucia’s brain, intent layer, wording, memory, and routing evolve. That repeatability is the foundation of serious improvement.

Current expanded definition

Eval Labs now captures multiple layers of review signal:
ratings
suggested review
quick employee review
human guidance evaluation
derived dataset membership suggestions
derived Human Review Queue 2.0 lane suggestions
persisted Behavioral Observatory labels
optional notes
review state
senior review routing
canon candidate signal
adjudication metadata
review lifecycle
dirty / completion state
exports for analysis
The core architectural distinction is:
employee reaction ≠ canonical training label
derived suggestion is not a saved label
Employees should provide structured reactions. Senior reviewers and adjudicators should assign canonical meaning. This protects Lucia from ontology drift while still allowing non-expert employees to participate in useful review work. Registry Diagnostics belongs to the derived-suggestion layer. It helps the team inspect how existing Eval Labs evidence appears to map to datasets and queue lanes. Behavioral Observatory belongs to the saved-label layer. It preserves intentional reviewer labels for intent, guest affect, response strategy, humanness, and notes when persistence succeeds.

Readiness doctrine

Eval Labs has now passed the AI-reviewed platform readiness gate:
60 completed runs
3,000 prompts
3,000 eval_run_items
3,000 Lucia responses
3,000 reviews
That result proves the platform lifecycle can run end to end. It does not prove Lucia is ready for real operators. Human review remains the true Lucia behavioral-quality judgment layer. The current platform is implemented for controlled role-based human onboarding. Evaluator workspace polish and first-cohort guidance remain active hardening.