Eval Labs is Lucia’s role-based human evaluation and platform-readiness infrastructure. It lets authorized users test, score, annotate, analyze, compare, oversee, and improve AI behavior over time.
Definition
Eval Labs is the internal product and workflow used to evaluate Lucia’s responses with role-based human judgment. It is not simply a prompt runner. It is first-class Lucia intelligence infrastructure: the place where Lucia’s behavior is tested, inspected, reviewed, analyzed, and improved. Eval Labs captures:- the prompt
- Lucia’s response
- the run source
- the suite context
- the human evaluator
- ratings
- suggested review signal
- quick employee review
- Human Guidance Evaluation
- pass/fail state
- written notes
- review lifecycle
- adjudication metadata
- role-gated access state
- Run History evidence
- Analysis and Single Run Analysis evidence
- Registry Diagnostics evidence
- Dataset Registry suggestion evidence
- Human Review Queue 2.0 lane suggestion evidence
- Behavioral Observatory labels
- exports for analysis
- dirty / completion state
- behavior patterns over time
Why Eval Labs exists
Generic AI benchmarks are not enough for Lucia. Lucia is not being built to win trivia tests. Lucia is being built to help hospitality operators stay oriented, make good decisions, and trust the system under real pressure. That means we must evaluate qualities that normal benchmarks miss:The product principle
Eval Labs exists because Lucia cannot become excellent by vibes. Lucia needs repeated, inspectable, human-scored evaluation against real behavioral expectations. The product also needs controlled readiness evidence that proves the evaluation platform itself can create runs, capture responses, persist reviews, finalize sessions, hydrate Run History, hydrate Analysis, and keep local client state compact. The system must help us answer:- Did Lucia understand the user?
- Did Lucia choose the right mode?
- Did Lucia say too much or too little?
- Did Lucia create calm or noise?
- Did Lucia preserve trust?
- Did Lucia overclaim?
- Did Lucia give the right next move?
What Eval Labs does today
Eval Labs currently supports:- custom 1–10 prompt tests
- saved custom prompt suites
- auto-generated prompt tests
- Guest Facing Agent Verification Check and Verification Results
- Controlled Batch Runner readiness checks
- shared Review Queue scoring
- suggested selections
- semantic scoring sliders
- Quick Review
- Human Guidance Evaluation
- lifecycle finalization
- Run History
- Team Review for owner/admin oversight
- Global Analysis
- Single Run Analysis
- Registry Diagnostics for derived Dataset Registry and Human Review Queue 2.0 inspection
- Behavioral Observatory for saved reviewer behavioral labels
- copy Session ID / copy Deep Link controls across key surfaces
- Clerk role gating for owner, admin, evaluator, and tester
- Supabase RLS protection for persisted evidence when the Clerk session carries
eval_labs_role - tester identity capture through Clerk
- JSON / CSV / Markdown exports
- run source tagging:
custom,automated,manual - Supabase persistence for suites, runs, items, and reviews
- live testing against the active dev Lucia Engine
What Eval Labs is not
Eval Labs is not:Who uses Eval Labs
Eval Labs is designed for:- founders
- owners/admins running platform and quality gates
- Lucia evaluators
- testers in the entry-level prompt-testing lane
- employees testing approved prompt workflows
- engineers validating behavior changes
- product leads reviewing quality patterns
- future QA and evaluation teams

