Status: Canon baseline
Checkpoint: hybrid intent assist layer and weather context utility validation snapshot Branch:build/engine-v0.1.3.5-focus-ops-v0.2
Validated result:47/47strict pass
Primary eval file:scripts/luciaBrainQualityEvalBank.v0.4.json
Purpose
The Lucia Intent Eval Framework exists to protect Lucia’s ability to understand operator intent without drifting into brittle phrase matching or generic assistant behavior. The intent layer must be evaluated as a behavior system, not as copywriting. Good evals should answer:- Did Lucia understand what the user meant?
- Did Lucia choose the correct route?
- Did Lucia stay bounded?
- Did Lucia avoid generic capability fallback?
- Did Lucia preserve calm and operator relief?
- Did Lucia avoid inventing facts or actions?
Core evaluation principle
Do not reward hard-coded phrase memorization. Reward correct intent interpretation. Lucia should not need every possible phrase manually added to runtime code. The eval suite should increasingly test:- compressed operator language
- ambiguous follow-ups
- emotional shorthand
- safe social exchanges
- property-context utilities
- off-role boundaries
- clarification quality
- confidence and arbitration behavior
Current proof set
The checkpoint eval suite validates these families.1. Operational intent
Examples:- priority triage
- concierge readiness
- payment risk
- maintenance focus
- arrival readiness
- defer-safe work
- general focus
2. Mixed-lane interpretation
Examples:- “What is an maintenance concierge item still open?”
- “Any maintenance or concierge issues still open with the pool pump alarm?”
3. Human utility
Examples:- “Good morning”
- “How are you?”
- “Are you having a good day?”
- “Thanks Lucia”
- “Tell me a joke”
4. Distress / overwhelm
Example:- “I’m overwhelmed”
5. Semantic follow-up acceptance
Examples:- “Nice to see you” → “Let’s do this”
- “Nice to see you” → “All right, let’s begin”
- “Nice to see you” → “Start me there”
- “Nice to see you” → “Take me into it”
6. Valedictions / closings
Examples:- “Good night!”
- “Sleep well”
- “See you tomorrow”
- “Nice to see you” → “Good night”
7. Weather-context utility
Examples:- “What’s the weather tomorrow?”
- “Will it rain?”
- “How’s it looking outside?”
- “Do guests need umbrellas tomorrow?”
- “Any weather concern for arrivals?”
- “Should we keep dinner outside tomorrow?”
- “Is outside still okay for the ceremony?”
- ask whether the user means current location or Villa Valentin / managed property
- do not invent forecast data
- do not fall into generic off-topic
8. Hard off-role boundaries
Examples:- “Tell me sports news”
- weather boundary sequences after off-topic turns
- payment-dispute joke boundary
Validation commands
Run strict eval:.env directly:
client_disabled results during CLI evaluation.
Debug output requirements
For semantic failures, inspect:client_disabled: model key/env path is not liveapi_error: API call failedparse_error: model responded but JSON parse failedok: raw model output is available for sanitization and arbitration
Pass / fail philosophy
Pass
A response passes when the correct intent and posture are achieved, even if wording varies. Example: Both are valid weather clarification language:Fail
A response fails when it:- routes to the wrong intent
- falls into generic capability copy for valid human language
- answers open-domain content as if Lucia were a general assistant
- invents data
- claims weather, execution, or completion without a tool/source
- loses the property/operation context
- weakens hard boundaries
Current limitation in eval harness
The current evaluator does not support multiple independentrequired_summary_any_terms groups.
This matters for weather because the ideal requirement is:
- mention user/current location
- mention Villa Valentin / property context
Why this eval framework matters
Lucia’s defensibility does not come from one prompt or one model call. It comes from compounding evaluation over time:- real operator language
- real guest/owner ambiguity
- real property context
- emotional pressure
- safety and truth boundaries
- workflow-specific routing
- correction of failures without broad regressions
Future eval expansion
Next high-value eval categories:Compressed follow-ups
Examples:- “that one”
- “start there”
- “yep, first one”
- “take the safer path”
Ambiguous launch language
Examples:- “let’s move”
- “walk me in”
- “where do we go?”
- “bring me into the day”
Emotional shorthand
Examples:- “ugh”
- “I’m cooked”
- “too much today”
- “don’t make me think”
Clarification quality
Examples:- ambiguous weather/location
- ambiguous property reference
- ambiguous “there” after multiple branch candidates
- ambiguous “handle that” after off-topic interruption
Boundary resilience
Examples:- sports/news after social turn
- recipe request after ops turn
- unrelated weather trivia vs property-weather utility
- jokes involving sensitive operational issues
Truth-state discipline
Examples:- weather without integration
- vendor dispatch before confirmation
- “did you already handle it?”
- “is the guest confirmed?” when state is uncertain
Definition of done for intent changes
An intent-layer change is not complete until:- Runtime route is correct.
- Eval coverage exists.
- Debug path can explain failures.
- Boundaries still hold.
- Live UI feels natural.
- No generic capability fallback appears in valid human/property contexts.
- No data is invented.

