Eval Labs Platform - HelloLucia

Eval Labs is the dedicated platform for testing Lucia’s responses, capturing review judgments, and hardening behavior over time.

Product Role

Eval Labs is a first-class product surface. EvaluationLabs.ai is Lucia’s proprietary evaluation platform for shaping her human intent layer, emotional awareness, psychological understanding, natural language interpretation, warmth, empathy, judgment, and operational intelligence. It supports:

prompt testing
response capture
human review
behavioral notes
quality scoring
future regression checks
guest identity/linkage regression
privacy leakage checks
payment truth regression checks

Why It Matters

Lucia’s intelligence cannot be trusted by vibes. It needs a system that can answer:

Did this response help?
Did it preserve calm?
Did it tell the truth?
Would an operator keep using Lucia?

Review Philosophy

Eval Labs should not reward polished nonsense. It should reward:

usefulness
clarity
restraint
truth
operator relief

Current Relationship to Lucia

Eval Labs is currently used to test Lucia’s operator-facing behavior, especially Focus Ops. Over time it should support:

intent regression testing
tone regression testing
response contract testing
launch readiness checks
model upgrade comparisons

Current validation focus includes v0.1.3.6 Focus Ops behavior, semantic conversational intent assist, scoped entity routing, Calendar/booking-spine awareness, Workspace OS context awareness, Signal Stream active_context, active_context.workspace surface awareness, prior recommendation memory, prior offer context, saved DAW workflow truth-state, the Signal → Action → Save → Reminder loop, Resolver Matrix route correctness, Dynamic Action Workspace render correctness, Full Booking Page route correctness, GPT-5.5 default model behavior, and Lucia JSON gateway behavior through the OpenAI Responses API. Guest-facing Lucia adds a second required validation surface:

guest identity orientation
claim-strength handling
magic-link verification eligibility
token consume and verified session state
privacy-safe verification email behavior
unlinked/candidate/verified guest signal routing
Admin Signal Stream guest signal visibility
Focus Ops no-drift behavior for candidate/unlinked signals
hospitality warmth and public-facing tone

This surface should be formalized as a first-class Guest-Facing Lucia Eval Track, separate from operator-facing Lucia evals. Purpose:

Manual testing cannot cover every possible guest phrasing.
Eval Labs must stress-test guest identity, verification, privacy, tone, and guest-to-operator routing at scale.

The track must cover identity orientation, orientation paths for already booked / joining someone / planning / exploring, booked-guest claim parsing, claim fragments across turns, weak vs strong claims, ambiguous/no-match cases, magic-link eligibility, email sent only to the booking email on file, no private data leakage, token consume/replay/expiry, verified session state, guest-to-operator signal creation, Admin Signal Stream visibility, Focus Ops no Luca/Nora drift, and warm hospitality tone. Payment truth adds a separate required financial-attention validation lane:

confirmed-paid suppression
pending-review restraint
failed/disputed/refunded not treated as paid
policy-unknown no due/overdue overclaim
deposit paid but final balance not yet due
final balance overdue only after policy + temporal truth support it
Admin rendering remains read-only / Engine source / no writes

This is future Eval Labs coverage. Current Harper Quinn proof is runtime Development proof, not broad Eval Labs certification.

v0.1.3.6 Dev Baseline Workflow

Eval Labs should validate v0.1.3.6 against:

https://api-dev.hellolucia.ai/admin/operator-focus

The canonical Focus Ops route is:

/admin/operator-focus

Strict brain quality eval reached 178/178 after workspace-context awareness. This is evidence to capture and review, not a permanent guarantee. v0.1.3.6 is not promoted to staging yet. Staging promotion waits until the Eval Labs dev baseline is captured and reviewed. Current live-dev build identity under review:

The live-dev topbar displays Admin and Engine build identity.
Admin identity is injected at Admin build time.
Engine identity is fetched from the Engine root endpoint.

Critical Review Question

The central question is:

Would a real operator keep talking to Lucia after this response?

If not, the response fails even if technically correct. For guest-facing Lucia, the parallel question is:

Would a real guest feel helped without the system leaking or inventing booking truth?

If not, the response fails even if it sounds friendly.

​Product Role

​Why It Matters

​Review Philosophy

​Current Relationship to Lucia

​v0.1.3.6 Dev Baseline Workflow

​Critical Review Question

​See Also

Product Role

Why It Matters

Review Philosophy

Current Relationship to Lucia

v0.1.3.6 Dev Baseline Workflow

Critical Review Question

See Also