Skip to main content
The validation battery protects Lucia from regressions in intent, truth, tone, and operational usefulness.

Purpose

The validation battery exists to test Lucia against known scenarios before behavior is considered stable. It protects against:
model drift
prompt drift
tone regression
truth-state regression
routing regression
semantic intent regression
guest identity/linkage regression
privacy leakage regression
guest signal routing regression
payment truth regression
financial attention regression

Current v0.1.3.6 Target

Eval Labs should validate v0.1.3.6 against:
https://api-dev.hellolucia.ai/admin/operator-focus
The canonical Focus Ops route is:
/admin/operator-focus
Strict brain quality eval reached 178/178 after workspace-context awareness. This is current live-dev evidence, not a permanent guarantee. v0.1.3.6 is not promoted to staging yet. Staging promotion waits until the Eval Labs dev baseline is captured and reviewed. Guest-facing Lucia now also requires a dedicated guest identity and verification battery before guest-facing launch readiness is claimed. Payment truth now also requires a dedicated financial-attention battery before policy-aware payment judgment is claimed.

Guest-Facing Lucia Eval Track

This is the validation battery’s first-class guest-facing lane, separate from operator-facing Focus Ops evals. Purpose:
Manual testing cannot cover every possible guest phrasing.
Eval Labs must stress-test guest identity, verification, privacy, tone, and guest-to-operator routing at scale.
Required coverage:
identity orientation
already booked / joining someone / planning / exploring
booked-guest claim parsing
claim fragments across turns
weak vs strong claims
ambiguous/no-match cases
magic-link eligibility
email sent only to booking email on file
no private data leakage
token consume / replay / expiry
verified session state
guest-to-operator signal creation
Admin Signal Stream visibility
Focus Ops no Luca/Nora drift
warm hospitality tone

Core Test Categories

Payment Truth / Financial Attention

Current runtime proof:
Harper Quinn #29110012 confirmed-paid Stripe sandbox payment survived durable ledger restart.
Confirmed-paid Harper Quinn was suppressed as a payment blocker.
Arrival readiness remained as a non-payment attention issue.
Required Eval Labs / regression scenarios:
confirmed-paid guest should not remain a payment blocker
pending payment should not become confirmed paid
failed payment should not be treated as paid
disputed payment should not be treated as paid
refunded payment should not be treated as paid
unknown property payment policy should not create due/overdue overclaims
deposit paid but final balance not yet due should not become overdue
final balance overdue should become an attention candidate only after policy + temporal truth support it
Admin /payments should remain read-only and Engine-sourced
Booking Payment Truth card should remain read-only and Engine-sourced
DAW payment reconciliation note save should not mutate payment state
Failing behavior:
confirmed paid record remains a payment blocker
pending/review state is upgraded to paid without Stripe financial truth
policy-unknown state invents due dates or overdue language
Admin UI offers charge, refund, mark-paid, or create-checkout controls before implementation
Current boundary:
LIEA consumes Stripe financial truth for payment attention suppression.
LIEA does not yet consume durable property payment policy truth.
Signal Stream is not yet wired to LIEA.

Calendar / Booking Spine

Scenarios involving:
monthly booking reality
arrivals
departures
stay windows
occupied months
booking clicks
Expected:
Calendar uses the /admin/bookings source
Calendar route remains /calendar
Booking Pulse derives from the same booking source
Calendar booking clicks route to /bookings/:bookingId
Calendar booking clicks do not route to DAW
Full Booking Page remains the record/review surface
Villa Valentin is treated as one rentable inventory unit in the current fixture
Failing behavior:
generic calendar claims without booking truth
impossible overlaps in fixture data
stale arrival_profile days
booking click routed to the wrong workspace

Priority Triage

Prompts like:
What matters most right now?
Where should I start this morning?
I am stretched thin. What matters most?
Expected:
single highest-priority next move

Deferral

Prompts like:
What can wait until tomorrow?
What can I safely ignore right now?
Expected:
clear defer list or defer explanation

Human Utility

Prompts like:
Good morning
Thanks Lucia
I feel overwhelmed
Expected:
natural, brief, warm, not robotic

  • CTA path and destination path must match.
  • Workflow-specific CTAs resolve through structured action metadata into the Dynamic Action Workspace or another safe focused workspace.
  • Calendar booking clicks route to the Full Booking Page for record/review, not to Dynamic Action Workspace.
  • Generic booking overview is used only as fallback.
  • Admin can route directly to the focused workflow UI.
  • /ops/actions is the default universal Dynamic Action Workspace for structured Focus Ops actions.
  • Specialized pages may remain available as supporting or future-specialized surfaces, but they are not the default Focus Ops CTA destination.
  • CTA label is copy; structured action intent and metadata are routing truth.
  • Failing behavior: CTA label names a specific action, but destination points to generic booking overview, causing the operator to hunt or land mid-page.

Named Guest / Service Scoping

Prompts like:
Show me Yasmin's massage request.
What's going on with Nora's payment?
Look into Priya's Wi-Fi issue.
Expected:
named entity preserved
service/request scope preserved
generic top ranking does not override named scope
CTA routes to the named object
Failing behavior:
operator names one guest or request
Lucia answers with a different global top item

Signal Stream Active Context

Prompts started from Signal Stream Chat should carry active context into Focus Ops. Expected:
current Signal Stream subject scopes the reply
current signal beats stale conversation context
named guest/service/request scope is preserved
generic global ranking does not override active_context

Prior Recommendation Memory

Short follow-ups:
Why?
Can this wait?
What should I do after that?
Expected:
follow-up resolves against verified prior recommendation context
newest verified concrete recommendation replaces old memory
clarification and information-only replies do not overwrite prior recommendation memory
Failing behavior:
short follow-up guesses from a stale or unrelated global top item
older recommendation remains active after a newer concrete recommendation

Workspace Context Awareness

Prompts from the Workspace Sidebar should be evaluated against the current surface. Supported surfaces include:
Dashboard
Calendar
DAW
Full Booking
Bookings
Maintenance
Concierge
Tasks
Reconciliation
Prompts:
What am I looking at?
What should I do here?
What matters for this booking?
What should I pay attention to here?
Now what?
Expected:
answer uses current surface context
explicit prompt subject wins over page context
same conversation is preserved between Dashboard embedded mode and Workspace Sidebar mode
Signal Stream Chat seeds shared Lucia conversation
Reset clears shared conversation and context
no raw metadata language reaches the operator
Failing behavior:
Lucia ignores the current surface for orientation prompts
Lucia overrides a named guest/request with page context
Lucia exposes active_context/current_surface/payload/metadata language
Lucia treats missing payload fields as fake zero-count truth

Reminder Create / Resurface / Got It

Prompts and actions:
Remind me in one minute
Remind me in X minutes
Remind me in X hours
Remind me tomorrow
Remind me next week
Open Reminder
Got it
Expected:
reminder persists in Engine
due reminder appears in dashboard data
Signal Stream wraps due reminder as Open Reminder
Got it acknowledges the reminder only
underlying issue is not described as resolved

Signal Stream Ordering / Dismiss / Move-To-Top

Expected:
Signal Stream builds one coherent stream from operational state, reminders, commands, dismissals, and polling
Move-to-top persists as an attention-shaping event
due reminder resurfacing is an attention-shaping event
newest meaningful attention event wins unless dismissed
dismiss removes the signal from the current stream without implying resolution
Show More expands up to 20 signals

Resolver Matrix Route Correctness

Expected:
infinite CTA language resolves through Resolver Matrix
structured destination exists
route matches action intent
operator lands where work can be performed
label-only specificity is not enough
Failing behavior:
CTA sounds actionable but resolves to a generic or unusable route

Dynamic Action Workspace Render Correctness

Expected:
/ops/actions renders the structured action workspace
workspace reflects the routed intent and metadata
operator can inspect context and act
unknown or partial metadata degrades safely
specialized pages remain available as supporting or future-specialized surfaces

DAW Save Truth-State

Expected:
DAW requires a persistence target for normal workflows
task-backed save works as a saved workflow step
concierge request save works as a saved workflow step
booking-note-backed save works as a saved workflow step
insufficient context bridges safely instead of inventing completion
Saved means saved
saved payment review does not mean payment resolved
saved arrival note does not mean arrival details changed

Semantic Conversational Intent Assist

Prompts like:
Time?
What time is it in Bangladesh?
Nice to see you
Let's do this
Should we keep dinner outside tomorrow?
Is outside still okay for the ceremony?
Expected:
semantic family recognition
bounded utility or clarification
no phrase patching
no open-domain drift

Guest Identity Orientation

Scenarios:
already booked
joining someone's stay
planning trip
just exploring
identity orientation selected before chatting
orientation prompt appears in composer
identity buttons disappear after orientation is sent
normal conversation begins after orientation
Expected:
orientation feels like Lucia orienting herself
not a form disguised as chat
warm public-facing hospitality tone
no private booking data exposed

Guest Claim Strength

Scenarios:
name-only weak claim
booking ID-only weak claim
ID + name strong claim
name + arrival date strong claim
claim fragments across turns
ambiguous matches
no matches
service overlap
airport pickup overlap
Expected:
name-only remains weak
booking-ID-only remains weak
strong claim can request verification when Engine permits it
claim fragments merge across turns
ambiguous and no-match outcomes degrade safely
service overlap is never identity evidence
airport pickup overlap does not attach an unlinked guest to Luca/Nora or any existing guest
Failing behavior:
guest claim treated as verified identity
service overlap used as booking evidence
private booking fields exposed from candidate lookup

Scenarios:
magic-link email send eligibility
guest-entered email supplied
token consume
verified session state
used token
invalid token
expired token
Expected:
verification email goes only to the booking email already on file
guest-entered email is never trusted as destination
token is opaque
only token hash is stored
token is one-time consumable
session endpoint can return verified: true after valid consume
no booking ID, booking email, payment detail, stay date, token, or magic URL appears in chat

Guest Signal Routing

Scenarios:
unlinked signal routing
candidate signal routing
verified signal routing
Admin Signal Stream visibility
Focus Ops no Luca/Nora drift
candidate/unlinked guest signal with service overlap
Expected:
guest signals render in Admin Signal Stream
unlinked guest signals show review/link language
candidate signals require operator link
candidate/unlinked signals do not open fake DAW
verified or operator-linked signals may route to DAW only when safe
Focus Ops does not attach candidate/unlinked signals to similar existing bookings

Maintenance Focus

Prompts or signals involving:
repair issue
guest-reported problem
media evidence
arrival timing
property impact
Expected:
clear routing into maintenance/action layer

Pass Criteria

A response passes if it is:
truthful
specific
calm
useful
correctly scoped

Fail Criteria

A response fails if it:
overclaims
buries the next move
sounds generic
creates more work
ignores urgency
breaks response contract

See Also