Skip to main content
Eval Labs improves Lucia by turning subjective impressions into repeated, inspectable behavioral evidence and by proving the evaluation platform can support Lucia intelligence work at scale.

The improvement mechanism

Eval Labs helps Lucia improve by creating this loop:
Behavior observedPattern identifiedSuggested review and human judgment comparedRun History / Analysis evidence inspectedOwner file inspectedSmallest correct patch madeDev deployedSame suite re-runHuman review confirms or rejects improvement

What Eval Labs catches

Eval Labs can catch:
  • wrong intent routing
  • tone drift
  • weak containment
  • generic language
  • overclaiming
  • missing next moves
  • payment-risk prioritization errors
  • arrival-readiness misses
  • concierge readiness gaps
  • multilingual regressions
  • model upgrade regressions
It also protects the evaluation platform itself by validating run creation, persistence, finalization, Run History, Analysis, and compact client state before employees depend on the system.

Why repeated suites matter

If we only test new prompts every time, we cannot tell whether Lucia improved. Custom suites let us compare before/after behavior. That turns product feel into product evidence.

Lucia source-of-truth behavior

Eval Labs does not patch Lucia. Eval Labs reveals where Lucia needs patching. For behavior issues, the likely Engine owner is usually one of:
operatorFocusBrain.js
refineOperatorFocusOutput.js
luciaModelConfig.js
luciaModelGateway.js
Wrong mode usually starts in operatorFocusBrain.js. Right mode but awkward language may belong in refineOperatorFocusOutput.js.

The real-world milestone

Eval Labs is now officially being used for Lucia refinement against the dev Engine. That means it is no longer just documentation or future infrastructure. It is part of the live development loop. As of the 60-run AI-reviewed gate, Eval Labs is also proven as platform infrastructure for readiness checks. That is a product-infrastructure milestone, not human approval of Lucia quality.

Updated improvement mechanism — from employee signal to canon signal

The improvement loop now separates signal quality:
Lucia behavior observedapp suggestions provide initial signalemployee quick review captures reactionHuman Guidance Evaluation captures structured judgmentreviewer-owned final judgment is savedsenior reviewer adjudicates important casesexported lifecycle evidence preserves the trailreusable learning becomes canon candidateengineering patches smallest correct layersame suite is re-runevidence confirms or rejects improvement
This prevents non-expert review from directly becoming Lucia doctrine while still letting the whole team contribute useful signal. The app may suggest, but the reviewer must decide. Reviewed exports preserve the signal chain: suggested review, employee review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and dirty / completion state.

Intelligence stack role

Eval Labs is part of Lucia’s intelligence stack. It helps harden:
  • truthfulness
  • emotional containment
  • operational usefulness
  • intent routing
  • trust-state discipline
  • evaluator feedback loops
  • platform evidence recovery for future threads
The Canon should therefore treat Eval Labs as product infrastructure, not as a side tool.