Skip to main content
Eval Labs scores Lucia across dimensions that matter to both operational quality and emotional containment.

Current dimensions

Eval Labs currently captures these rating dimensions:
tone
usefulness
calming
naturalness
trust
It also captures:
keepTalking
suggestedKeepTalking
status
suggestedStatus
priority
suggestedPriority
feltOff
owner

Tone

Score whether the language fits the moment. Strong tone is:
  • warm
  • clear
  • direct
  • composed
  • human
Weak tone is:
  • cold
  • robotic
  • mushy
  • fake cheerful
  • corporate sludge

Usefulness

Score whether the response helped the user act or understand. A useful response reduces work. An unhelpful response creates new work.

Calming

Score whether the response reduces pressure. Calming does not mean soft. Calming means the user feels more oriented after reading it.

Naturalness

Score whether the response sounds like a real trusted operator would speak. Natural does not mean casual fluff. Natural means the phrasing feels human and appropriate.

Trust

Score whether the response increases or preserves confidence in Lucia. Trust is damaged by:
  • overclaiming
  • vague certainty
  • missing obvious context
  • wrong tone
  • false reassurance
  • capability menus in emotional moments

Keep talking

This answers:
Would a user keep talking to Lucia after this response?
Use this honestly. If a response makes Lucia feel like a wall, mark it down.

Felt off

Use this field for specific notes. Good:
Lucia detected operational stress but responded with a generic capability redirect instead of containment.
Bad:
Weird.

Semantic confidence sliders

The five scoring dimensions use stepped 1–10 semantic sliders. The slider is not decoration. It is part of the evaluation interface. A low score should feel like concern. A middle score should feel mixed or uncertain. A high score should feel confident. This reduces the amount of mental translation required from reviewers. The app may show suggested 1–10 values before the reviewer chooses a score. A visible suggestion is not the saved score until the reviewer accepts or overrides the review and saves.

Human Guidance Evaluation

Eval Labs also captures 1–5 Human Guidance Evaluation scores:
emotionalValidation
cognitiveUnderstanding
actionability
toneAppropriateness
authenticity
notes
The Review Queue can show suggested 1–5 guidance scores. The displayed guidance state uses the mean score and treats any score of 2 or below as a hard-fail signal. Warmth and intelligence are not separate dimensions. In the current product, they are expressed through tone, calming, naturalness, trust, usefulness, cognitiveUnderstanding, actionability, and authenticity.

Quick Review fields

In addition to scoring dimensions, Eval Labs captures:
Did Lucia understand what was needed?
Did Lucia give the right next move?
Did Lucia make the situation feel calmer?
Did anything feel risky, confusing, or wrong?
Should a senior reviewer look at this?
Could this teach Lucia something reusable?
These fields are not replacements for senior adjudication. They are the employee signal layer. Suggested Quick Review selections are allowed. They should reduce reviewer burden, not replace reviewer judgment.