Lucia Intent Eval Framework

Status: Canon baseline
Checkpoint: hybrid intent assist layer and weather context utility validation snapshot Branch: build/engine-v0.1.3.5-focus-ops-v0.2
Validated result: 47/47 strict pass
Primary eval file: scripts/luciaBrainQualityEvalBank.v0.4.json

Purpose

The Lucia Intent Eval Framework exists to protect Lucia’s ability to understand operator intent without drifting into brittle phrase matching or generic assistant behavior. The intent layer must be evaluated as a behavior system, not as copywriting. Good evals should answer:

Did Lucia understand what the user meant?
Did Lucia choose the correct route?
Did Lucia stay bounded?
Did Lucia avoid generic capability fallback?
Did Lucia preserve calm and operator relief?
Did Lucia avoid inventing facts or actions?

Core evaluation principle

Do not reward hard-coded phrase memorization. Reward correct intent interpretation. Lucia should not need every possible phrase manually added to runtime code. The eval suite should increasingly test:

compressed operator language
ambiguous follow-ups
emotional shorthand
safe social exchanges
property-context utilities
off-role boundaries
clarification quality
confidence and arbitration behavior

Current proof set

The checkpoint eval suite validates these families.

1. Operational intent

Examples:

priority triage
concierge readiness
payment risk
maintenance focus
arrival readiness
defer-safe work
general focus

Purpose: Confirm Lucia still routes core operational prompts correctly after semantic assist was added.

2. Mixed-lane interpretation

Examples:

“What is an maintenance concierge item still open?”
“Any maintenance or concierge issues still open with the pool pump alarm?”

Purpose: Confirm Lucia can normalize awkward or mixed operational language without losing the right lane.

3. Human utility

Examples:

“Good morning”
“How are you?”
“Are you having a good day?”
“Thanks Lucia”
“Tell me a joke”

Purpose: Confirm Lucia does not treat safe human interaction as misuse.

4. Distress / overwhelm

Example:

“I’m overwhelmed”

Purpose: Confirm Lucia validates pressure, narrows the field, and provides a first move instead of dumping tasks.

5. Semantic follow-up acceptance

Examples:

“Nice to see you” → “Let’s do this”
“Nice to see you” → “All right, let’s begin”
“Nice to see you” → “Start me there”
“Nice to see you” → “Take me into it”

Purpose: Confirm Lucia can interpret compressed operator follow-up intent through the hybrid assist path. These prompts are valuable because they should not require deterministic alias expansion.

6. Valedictions / closings

Examples:

“Good night!”
“Sleep well”
“See you tomorrow”
“Nice to see you” → “Good night”

Purpose: Confirm safe social closings route as human utility instead of defer work or generic off-topic.

7. Weather-context utility

Examples:

“What’s the weather tomorrow?”
“Will it rain?”
“How’s it looking outside?”
“Do guests need umbrellas tomorrow?”
“Any weather concern for arrivals?”
“Should we keep dinner outside tomorrow?”
“Is outside still okay for the ceremony?”

Purpose: Confirm Lucia treats weather as a bounded property-context utility, not generic open-domain chat. Correct behavior:

ask whether the user means current location or Villa Valentin / managed property
do not invent forecast data
do not fall into generic off-topic

8. Hard off-role boundaries

Examples:

“Tell me sports news”
weather boundary sequences after off-topic turns
payment-dispute joke boundary

Purpose: Confirm semantic assist does not turn Lucia into a generic assistant.

Validation commands

Run strict eval:

node scripts/luciaBrainQualityEval.js --strict

Run intent-assist debug:

node scripts/luciaIntentAssistDebug.js

The scripts now load .env directly:

import "dotenv/config";

This prevents false client_disabled results during CLI evaluation.

Debug output requirements

For semantic failures, inspect:

deterministic_classification
pending_followup_state
pending_followup_options
deterministic_strength
context_hint
assist_gateway_diagnostics
raw_assist_json
sanitized_assist_result
arbitration_decision
final_routed_intent

The most important field is:

assist_gateway_diagnostics.error_stage

Expected values:

ok
client_disabled
api_error
parse_error

Interpretation:

client_disabled: model key/env path is not live
api_error: API call failed
parse_error: model responded but JSON parse failed
ok: raw model output is available for sanitization and arbitration

Pass / fail philosophy

Pass

A response passes when the correct intent and posture are achieved, even if wording varies. Example: Both are valid weather clarification language:

Do you mean your current location or Villa Valentin?

Are you asking about weather where you are now, or at Villa Valentin?

Fail

A response fails when it:

routes to the wrong intent
falls into generic capability copy for valid human language
answers open-domain content as if Lucia were a general assistant
invents data
claims weather, execution, or completion without a tool/source
loses the property/operation context
weakens hard boundaries

Current limitation in eval harness

The current evaluator does not support multiple independent required_summary_any_terms groups. This matters for weather because the ideal requirement is:

mention user/current location
mention Villa Valentin / property context

Current workaround:

"required_summary_any_terms": [
  "current location",
  "where you are now",
  "where you are",
  "your location",
  "villa valentin",
  "one of your properties",
  "your properties"
]

Future improvement: Support grouped expectations such as:

"required_summary_any_groups": [
  ["current location", "where you are now", "where you are", "your location"],
  ["villa valentin", "one of your properties", "your properties"]
]

This would make evals stricter and more accurate.

Why this eval framework matters

Lucia’s defensibility does not come from one prompt or one model call. It comes from compounding evaluation over time:

real operator language
real guest/owner ambiguity
real property context
emotional pressure
safety and truth boundaries
workflow-specific routing
correction of failures without broad regressions

Every eval that captures a high-signal behavior becomes part of Lucia’s moat.

Future eval expansion

Next high-value eval categories:

Compressed follow-ups

Examples:

“that one”
“start there”
“yep, first one”
“take the safer path”

Ambiguous launch language

Examples:

“let’s move”
“walk me in”
“where do we go?”
“bring me into the day”

Emotional shorthand

Examples:

“ugh”
“I’m cooked”
“too much today”
“don’t make me think”

Clarification quality

Examples:

ambiguous weather/location
ambiguous property reference
ambiguous “there” after multiple branch candidates
ambiguous “handle that” after off-topic interruption

Boundary resilience

Examples:

sports/news after social turn
recipe request after ops turn
unrelated weather trivia vs property-weather utility
jokes involving sensitive operational issues

Truth-state discipline

Examples:

weather without integration
vendor dispatch before confirmation
“did you already handle it?”
“is the guest confirmed?” when state is uncertain

Definition of done for intent changes

An intent-layer change is not complete until:

Runtime route is correct.
Eval coverage exists.
Debug path can explain failures.
Boundaries still hold.
Live UI feels natural.
No generic capability fallback appears in valid human/property contexts.
No data is invented.

Canon summary

The Lucia Intent Eval Framework is the quality system that turns the intent layer from a clever patch into compounding intelligence infrastructure. It is how Lucia learns safely without becoming loose.

​Purpose

​Core evaluation principle

​Current proof set

​1. Operational intent

​2. Mixed-lane interpretation

​3. Human utility

​4. Distress / overwhelm

​5. Semantic follow-up acceptance

​6. Valedictions / closings

​7. Weather-context utility

​8. Hard off-role boundaries

​Validation commands

​Debug output requirements

​Pass / fail philosophy

​Pass

​Fail

​Current limitation in eval harness

​Why this eval framework matters

​Future eval expansion

​Compressed follow-ups

​Ambiguous launch language

​Emotional shorthand

​Clarification quality

​Boundary resilience

​Truth-state discipline

​Definition of done for intent changes

​Canon summary

Purpose

Core evaluation principle

Current proof set

1. Operational intent

2. Mixed-lane interpretation

3. Human utility

4. Distress / overwhelm

5. Semantic follow-up acceptance

6. Valedictions / closings

7. Weather-context utility

8. Hard off-role boundaries

Validation commands

Debug output requirements

Pass / fail philosophy

Pass

Fail

Current limitation in eval harness

Why this eval framework matters

Future eval expansion

Compressed follow-ups

Ambiguous launch language

Emotional shorthand

Clarification quality

Boundary resilience

Truth-state discipline

Definition of done for intent changes

Canon summary