How Eval Labs Improves Lucia

Eval Labs improves Lucia by turning subjective impressions into repeated, inspectable behavioral evidence and by proving the evaluation platform can support Lucia intelligence work at scale.

The improvement mechanism

Eval Labs helps Lucia improve by creating this loop:

Behavior observedPattern identifiedSuggested review and human judgment comparedRun History / Analysis evidence inspectedOwner file inspectedSmallest correct patch madeDev deployedSame suite re-runHuman review confirms or rejects improvement

What Eval Labs catches

Eval Labs can catch:

wrong intent routing
tone drift
weak containment
generic language
overclaiming
missing next moves
payment-risk prioritization errors
arrival-readiness misses
concierge readiness gaps
multilingual regressions
model upgrade regressions

It also protects the evaluation platform itself by validating run creation, persistence, finalization, Run History, Analysis, and compact client state before employees depend on the system.

Why repeated suites matter

If we only test new prompts every time, we cannot tell whether Lucia improved. Custom suites let us compare before/after behavior. That turns product feel into product evidence.

Lucia source-of-truth behavior

Eval Labs does not patch Lucia. Eval Labs reveals where Lucia needs patching. For behavior issues, the likely Engine owner is usually one of:

operatorFocusBrain.js
refineOperatorFocusOutput.js
luciaModelConfig.js
luciaModelGateway.js

Wrong mode usually starts in operatorFocusBrain.js. Right mode but awkward language may belong in refineOperatorFocusOutput.js.

The real-world milestone

Eval Labs is now officially being used for Lucia refinement against the dev Engine. That means it is no longer just documentation or future infrastructure. It is part of the live development loop. As of the 60-run AI-reviewed gate, Eval Labs is also proven as platform infrastructure for readiness checks. That is a product-infrastructure milestone, not human approval of Lucia quality.

Updated improvement mechanism — from employee signal to canon signal

The improvement loop now separates signal quality:

Lucia behavior observedapp suggestions provide initial signalemployee quick review captures reactionHuman Guidance Evaluation captures structured judgmentreviewer-owned final judgment is savedsenior reviewer adjudicates important casesexported lifecycle evidence preserves the trailreusable learning becomes canon candidateengineering patches smallest correct layersame suite is re-runevidence confirms or rejects improvement

This prevents non-expert review from directly becoming Lucia doctrine while still letting the whole team contribute useful signal. The app may suggest, but the reviewer must decide. Reviewed exports preserve the signal chain: suggested review, employee review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and dirty / completion state.

Intelligence stack role

Eval Labs is part of Lucia’s intelligence stack. It helps harden:

truthfulness
emotional containment
operational usefulness
intent routing
trust-state discipline
evaluator feedback loops
platform evidence recovery for future threads

The Canon should therefore treat Eval Labs as product infrastructure, not as a side tool.

Eval Labs Step-by-Step Operator Guide Intent Layer Refinement Workflow

⌘I

​The improvement mechanism

​What Eval Labs catches

​Why repeated suites matter

​Lucia source-of-truth behavior

​The real-world milestone

​Updated improvement mechanism — from employee signal to canon signal

​Intelligence stack role

The improvement mechanism

What Eval Labs catches

Why repeated suites matter

Lucia source-of-truth behavior

The real-world milestone

Updated improvement mechanism — from employee signal to canon signal

Intelligence stack role