What Eval Labs Is - HelloLucia

Eval Labs is Lucia’s role-based human evaluation and platform-readiness infrastructure. It lets authorized users test, score, annotate, analyze, compare, oversee, and improve AI behavior over time.

Definition

Eval Labs is the internal product and workflow used to evaluate Lucia’s responses with role-based human judgment. It is not simply a prompt runner. It is first-class Lucia intelligence infrastructure: the place where Lucia’s behavior is tested, inspected, reviewed, analyzed, and improved. Eval Labs captures:

the prompt
Lucia’s response
the run source
the suite context
the human evaluator
ratings
suggested review signal
quick employee review
Human Guidance Evaluation
pass/fail state
written notes
review lifecycle
adjudication metadata
role-gated access state
Run History evidence
Analysis and Single Run Analysis evidence
Registry Diagnostics evidence
Dataset Registry suggestion evidence
Human Review Queue 2.0 lane suggestion evidence
Behavioral Observatory labels
exports for analysis
dirty / completion state
behavior patterns over time

Why Eval Labs exists

Generic AI benchmarks are not enough for Lucia. Lucia is not being built to win trivia tests. Lucia is being built to help hospitality operators stay oriented, make good decisions, and trust the system under real pressure. That means we must evaluate qualities that normal benchmarks miss:

calm
warmth
trust
intent accuracy
operational usefulness
containment
truth-state discipline
operator cognitive load

The product principle

Eval Labs exists because Lucia cannot become excellent by vibes. Lucia needs repeated, inspectable, human-scored evaluation against real behavioral expectations. The product also needs controlled readiness evidence that proves the evaluation platform itself can create runs, capture responses, persist reviews, finalize sessions, hydrate Run History, hydrate Analysis, and keep local client state compact. The system must help us answer:

Did Lucia understand the user?
Did Lucia choose the right mode?
Did Lucia say too much or too little?
Did Lucia create calm or noise?
Did Lucia preserve trust?
Did Lucia overclaim?
Did Lucia give the right next move?

What Eval Labs does today

Eval Labs currently supports:

custom 1–10 prompt tests
saved custom prompt suites
auto-generated prompt tests
Guest Facing Agent Verification Check and Verification Results
Controlled Batch Runner readiness checks
shared Review Queue scoring
suggested selections
semantic scoring sliders
Quick Review
Human Guidance Evaluation
lifecycle finalization
Run History
Team Review for owner/admin oversight
Global Analysis
Single Run Analysis
Registry Diagnostics for derived Dataset Registry and Human Review Queue 2.0 inspection
Behavioral Observatory for saved reviewer behavioral labels
copy Session ID / copy Deep Link controls across key surfaces
Clerk role gating for owner, admin, evaluator, and tester
Supabase RLS protection for persisted evidence when the Clerk session carries eval_labs_role
tester identity capture through Clerk
JSON / CSV / Markdown exports
run source tagging: custom, automated, manual
Supabase persistence for suites, runs, items, and reviews
live testing against the active dev Lucia Engine

What Eval Labs is not

Eval Labs is not:

a generic chat app
a one-off prompt playground
a rubber-stamp review form
a dataset label factory
a vague behavior dashboard
a replacement for product judgment
a replacement for Lucia doctrine
a claim that Lucia is human-approved
a complete backend authorization boundary by itself

Eval Labs is:

Lucia-native behavioral judgment infrastructure

Who uses Eval Labs

Eval Labs is designed for:

founders
owners/admins running platform and quality gates
Lucia evaluators
testers in the entry-level prompt-testing lane
employees testing approved prompt workflows
engineers validating behavior changes
product leads reviewing quality patterns
future QA and evaluation teams

Owner/admin have full platform access, shared persisted evidence, Team Review, Global Analysis, and all current test surfaces. Evaluator is the full evaluator workbench role. Evaluators can use evaluator-safe test types and their own run/review/history routes, but they do not see Team Review or Global Analysis. Tester is the narrower onboarding role. Testers can use Custom Prompt Test and Auto-generated Prompt Test, but they cannot use verification, controlled batch, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools. Read the canonical matrix: Eval Labs Roles and Access Matrix.

Why saved suites matter

Saved custom prompt suites changed Eval Labs from a testing slot machine into a regression lab. Before saved suites, reviewers could run broad generated tests. Now reviewers can repeatedly test the same exact prompts while Lucia’s brain, intent layer, wording, memory, and routing evolve. That repeatability is the foundation of serious improvement.

Current expanded definition

Eval Labs now captures multiple layers of review signal:

ratings
suggested review
quick employee review
human guidance evaluation
derived dataset membership suggestions
derived Human Review Queue 2.0 lane suggestions
persisted Behavioral Observatory labels
optional notes
review state
senior review routing
canon candidate signal
adjudication metadata
review lifecycle
dirty / completion state
exports for analysis

The core architectural distinction is:

employee reaction ≠ canonical training label
derived suggestion is not a saved label

Employees should provide structured reactions. Senior reviewers and adjudicators should assign canonical meaning. This protects Lucia from ontology drift while still allowing non-expert employees to participate in useful review work. Registry Diagnostics belongs to the derived-suggestion layer. It helps the team inspect how existing Eval Labs evidence appears to map to datasets and queue lanes. Behavioral Observatory belongs to the saved-label layer. It preserves intentional reviewer labels for intent, guest affect, response strategy, humanness, and notes when persistence succeeds.

Readiness doctrine

Eval Labs has now passed the AI-reviewed platform readiness gate:

60 completed runs
3,000 prompts
3,000 eval_run_items
3,000 Lucia responses
3,000 reviews

That result proves the platform lifecycle can run end to end. It does not prove Lucia is ready for real operators. Human review remains the true Lucia behavioral-quality judgment layer. The current platform is implemented for controlled role-based human onboarding. Evaluator workspace polish and first-cohort guidance remain active hardening.

​Definition

​Why Eval Labs exists

​The product principle

​What Eval Labs does today

​What Eval Labs is not

​Who uses Eval Labs

​Why saved suites matter

​Current expanded definition

​Readiness doctrine

Definition

Why Eval Labs exists

The product principle

What Eval Labs does today

What Eval Labs is not

Who uses Eval Labs

Why saved suites matter

Current expanded definition

Readiness doctrine