Skip to main content
Status: Canon baseline
Checkpoint: hybrid intent assist layer and weather context utility validation snapshot Branch: build/engine-v0.1.3.5-focus-ops-v0.2
Validated result: 47/47 strict pass
Primary eval file: scripts/luciaBrainQualityEvalBank.v0.4.json

Purpose

The Lucia Intent Eval Framework exists to protect Lucia’s ability to understand operator intent without drifting into brittle phrase matching or generic assistant behavior. The intent layer must be evaluated as a behavior system, not as copywriting. Good evals should answer:
  • Did Lucia understand what the user meant?
  • Did Lucia choose the correct route?
  • Did Lucia stay bounded?
  • Did Lucia avoid generic capability fallback?
  • Did Lucia preserve calm and operator relief?
  • Did Lucia avoid inventing facts or actions?

Core evaluation principle

Do not reward hard-coded phrase memorization. Reward correct intent interpretation. Lucia should not need every possible phrase manually added to runtime code. The eval suite should increasingly test:
  • compressed operator language
  • ambiguous follow-ups
  • emotional shorthand
  • safe social exchanges
  • property-context utilities
  • off-role boundaries
  • clarification quality
  • confidence and arbitration behavior

Current proof set

The checkpoint eval suite validates these families.

1. Operational intent

Examples:
  • priority triage
  • concierge readiness
  • payment risk
  • maintenance focus
  • arrival readiness
  • defer-safe work
  • general focus
Purpose: Confirm Lucia still routes core operational prompts correctly after semantic assist was added.

2. Mixed-lane interpretation

Examples:
  • “What is an maintenance concierge item still open?”
  • “Any maintenance or concierge issues still open with the pool pump alarm?”
Purpose: Confirm Lucia can normalize awkward or mixed operational language without losing the right lane.

3. Human utility

Examples:
  • “Good morning”
  • “How are you?”
  • “Are you having a good day?”
  • “Thanks Lucia”
  • “Tell me a joke”
Purpose: Confirm Lucia does not treat safe human interaction as misuse.

4. Distress / overwhelm

Example:
  • “I’m overwhelmed”
Purpose: Confirm Lucia validates pressure, narrows the field, and provides a first move instead of dumping tasks.

5. Semantic follow-up acceptance

Examples:
  • “Nice to see you” → “Let’s do this”
  • “Nice to see you” → “All right, let’s begin”
  • “Nice to see you” → “Start me there”
  • “Nice to see you” → “Take me into it”
Purpose: Confirm Lucia can interpret compressed operator follow-up intent through the hybrid assist path. These prompts are valuable because they should not require deterministic alias expansion.

6. Valedictions / closings

Examples:
  • “Good night!”
  • “Sleep well”
  • “See you tomorrow”
  • “Nice to see you” → “Good night”
Purpose: Confirm safe social closings route as human utility instead of defer work or generic off-topic.

7. Weather-context utility

Examples:
  • “What’s the weather tomorrow?”
  • “Will it rain?”
  • “How’s it looking outside?”
  • “Do guests need umbrellas tomorrow?”
  • “Any weather concern for arrivals?”
  • “Should we keep dinner outside tomorrow?”
  • “Is outside still okay for the ceremony?”
Purpose: Confirm Lucia treats weather as a bounded property-context utility, not generic open-domain chat. Correct behavior:
  • ask whether the user means current location or Villa Valentin / managed property
  • do not invent forecast data
  • do not fall into generic off-topic

8. Hard off-role boundaries

Examples:
  • “Tell me sports news”
  • weather boundary sequences after off-topic turns
  • payment-dispute joke boundary
Purpose: Confirm semantic assist does not turn Lucia into a generic assistant.

Validation commands

Run strict eval:
node scripts/luciaBrainQualityEval.js --strict
Run intent-assist debug:
node scripts/luciaIntentAssistDebug.js
The scripts now load .env directly:
import "dotenv/config";
This prevents false client_disabled results during CLI evaluation.

Debug output requirements

For semantic failures, inspect:
deterministic_classification
pending_followup_state
pending_followup_options
deterministic_strength
context_hint
assist_gateway_diagnostics
raw_assist_json
sanitized_assist_result
arbitration_decision
final_routed_intent
The most important field is:
assist_gateway_diagnostics.error_stage
Expected values:
ok
client_disabled
api_error
parse_error
Interpretation:
  • client_disabled: model key/env path is not live
  • api_error: API call failed
  • parse_error: model responded but JSON parse failed
  • ok: raw model output is available for sanitization and arbitration

Pass / fail philosophy

Pass

A response passes when the correct intent and posture are achieved, even if wording varies. Example: Both are valid weather clarification language:
Do you mean your current location or Villa Valentin?
Are you asking about weather where you are now, or at Villa Valentin?

Fail

A response fails when it:
  • routes to the wrong intent
  • falls into generic capability copy for valid human language
  • answers open-domain content as if Lucia were a general assistant
  • invents data
  • claims weather, execution, or completion without a tool/source
  • loses the property/operation context
  • weakens hard boundaries

Current limitation in eval harness

The current evaluator does not support multiple independent required_summary_any_terms groups. This matters for weather because the ideal requirement is:
  1. mention user/current location
  2. mention Villa Valentin / property context
Current workaround:
"required_summary_any_terms": [
  "current location",
  "where you are now",
  "where you are",
  "your location",
  "villa valentin",
  "one of your properties",
  "your properties"
]
Future improvement: Support grouped expectations such as:
"required_summary_any_groups": [
  ["current location", "where you are now", "where you are", "your location"],
  ["villa valentin", "one of your properties", "your properties"]
]
This would make evals stricter and more accurate.

Why this eval framework matters

Lucia’s defensibility does not come from one prompt or one model call. It comes from compounding evaluation over time:
  • real operator language
  • real guest/owner ambiguity
  • real property context
  • emotional pressure
  • safety and truth boundaries
  • workflow-specific routing
  • correction of failures without broad regressions
Every eval that captures a high-signal behavior becomes part of Lucia’s moat.

Future eval expansion

Next high-value eval categories:

Compressed follow-ups

Examples:
  • “that one”
  • “start there”
  • “yep, first one”
  • “take the safer path”

Ambiguous launch language

Examples:
  • “let’s move”
  • “walk me in”
  • “where do we go?”
  • “bring me into the day”

Emotional shorthand

Examples:
  • “ugh”
  • “I’m cooked”
  • “too much today”
  • “don’t make me think”

Clarification quality

Examples:
  • ambiguous weather/location
  • ambiguous property reference
  • ambiguous “there” after multiple branch candidates
  • ambiguous “handle that” after off-topic interruption

Boundary resilience

Examples:
  • sports/news after social turn
  • recipe request after ops turn
  • unrelated weather trivia vs property-weather utility
  • jokes involving sensitive operational issues

Truth-state discipline

Examples:
  • weather without integration
  • vendor dispatch before confirmation
  • “did you already handle it?”
  • “is the guest confirmed?” when state is uncertain

Definition of done for intent changes

An intent-layer change is not complete until:
  1. Runtime route is correct.
  2. Eval coverage exists.
  3. Debug path can explain failures.
  4. Boundaries still hold.
  5. Live UI feels natural.
  6. No generic capability fallback appears in valid human/property contexts.
  7. No data is invented.

Canon summary

The Lucia Intent Eval Framework is the quality system that turns the intent layer from a clever patch into compounding intelligence infrastructure. It is how Lucia learns safely without becoming loose.