Eval Labs improves Lucia by turning subjective impressions into repeated, inspectable behavioral evidence and by proving the evaluation platform can support Lucia intelligence work at scale.
The improvement mechanism
Eval Labs helps Lucia improve by creating this loop:Behavior observedPattern identifiedSuggested review and human judgment comparedRun History / Analysis evidence inspectedOwner file inspectedSmallest correct patch madeDev deployedSame suite re-runHuman review confirms or rejects improvement
What Eval Labs catches
Eval Labs can catch:- wrong intent routing
- tone drift
- weak containment
- generic language
- overclaiming
- missing next moves
- payment-risk prioritization errors
- arrival-readiness misses
- concierge readiness gaps
- multilingual regressions
- model upgrade regressions
Why repeated suites matter
If we only test new prompts every time, we cannot tell whether Lucia improved. Custom suites let us compare before/after behavior. That turns product feel into product evidence.Lucia source-of-truth behavior
Eval Labs does not patch Lucia. Eval Labs reveals where Lucia needs patching. For behavior issues, the likely Engine owner is usually one of:operatorFocusBrain.js.
Right mode but awkward language may belong in refineOperatorFocusOutput.js.
The real-world milestone
Eval Labs is now officially being used for Lucia refinement against the dev Engine. That means it is no longer just documentation or future infrastructure. It is part of the live development loop. As of the 60-run AI-reviewed gate, Eval Labs is also proven as platform infrastructure for readiness checks. That is a product-infrastructure milestone, not human approval of Lucia quality.Updated improvement mechanism — from employee signal to canon signal
The improvement loop now separates signal quality:Lucia behavior observedapp suggestions provide initial signalemployee quick review captures reactionHuman Guidance Evaluation captures structured judgmentreviewer-owned final judgment is savedsenior reviewer adjudicates important casesexported lifecycle evidence preserves the trailreusable learning becomes canon candidateengineering patches smallest correct layersame suite is re-runevidence confirms or rejects improvement
This prevents non-expert review from directly becoming Lucia doctrine while still letting the whole team contribute useful signal.
The app may suggest, but the reviewer must decide.
Reviewed exports preserve the signal chain: suggested review, employee review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and dirty / completion state.
Intelligence stack role
Eval Labs is part of Lucia’s intelligence stack. It helps harden:- truthfulness
- emotional containment
- operational usefulness
- intent routing
- trust-state discipline
- evaluator feedback loops
- platform evidence recovery for future threads

