Eval Labs captures human judgment through guided structure so the output is useful for product, engineering, and future model refinement.
Core idea
Eval Labs does not collect opinions for their own sake.
It collects structured human judgment.
The goal is to preserve the human signal while removing avoidable noise.
Freeform review creates problems:
reviewer language drift
inconsistent labels
hard-to-compare exports
ambiguous training signal
slow onboarding
employee hesitation
Freeform notes are still useful, but they should support structured review. They should not be the primary data layer.
Preferred structure
Eval Labs should collect:
- ratings
- guided quick-review answers
- human guidance scores
- escalation state
- optional short notes
- adjudication metadata when needed
- Behavioral Observatory labels when assigned
- exports that preserve all layers
The current Human Guidance Evaluation dimensions are:
emotionalValidation
cognitiveUnderstanding
actionability
toneAppropriateness
authenticity
notes
Warmth, intelligence, and emotional quality should map into these existing fields.
They should not become separate magical categories.
Why this improves data quality
Structured human judgment makes it easier to compare:
- model versions
- prompt suites
- behavior families
- reviewer patterns
- failure clusters
- canon candidates
It also helps prevent employee reviewers from accidentally training Lucia with inconsistent language.
UX principle
The better the review UX, the better the data.
If reviewers hesitate, overthink, or invent categories, the signal gets worse.
The interface should make the right review behavior feel obvious.
The app may show suggested human guidance scores before the reviewer saves.
Those suggestions can reduce friction, but the reviewer still owns the final score.
Human Guidance Evaluation also produces a mean score and can surface hard-fail behavior when any scored dimension is very weak.
Behavioral Observatory structure
Behavioral Observatory adds a focused structured-label layer:
Intent
Guest Affect
Response Strategy
Humanness
Notes
These fields are Lucia-specific behavioral evidence for what the human needed, how the human felt, what Lucia did, how human the response felt, and what evidence should be preserved.
Derived suggestions can prefill context, but only saved Behavioral Observatory labels count as persisted label data.
Export principle
Exports should preserve both:
simple employee signal
senior adjudication signal
This lets analysis separate fast human reaction from final canonical meaning.