Structured Human Judgment Capture

Eval Labs captures human judgment through guided structure so the output is useful for product, engineering, and future model refinement.

Core idea

Eval Labs does not collect opinions for their own sake. It collects structured human judgment. The goal is to preserve the human signal while removing avoidable noise.

Why freeform-only review fails

Freeform review creates problems:

reviewer language drift
inconsistent labels
hard-to-compare exports
ambiguous training signal
slow onboarding
employee hesitation

Freeform notes are still useful, but they should support structured review. They should not be the primary data layer.

Preferred structure

Eval Labs should collect:

ratings
guided quick-review answers
human guidance scores
escalation state
optional short notes
adjudication metadata when needed
Behavioral Observatory labels when assigned
exports that preserve all layers

The current Human Guidance Evaluation dimensions are:

emotionalValidation
cognitiveUnderstanding
actionability
toneAppropriateness
authenticity
notes

Warmth, intelligence, and emotional quality should map into these existing fields. They should not become separate magical categories.

Why this improves data quality

Structured human judgment makes it easier to compare:

model versions
prompt suites
behavior families
reviewer patterns
failure clusters
canon candidates

It also helps prevent employee reviewers from accidentally training Lucia with inconsistent language.

UX principle

The better the review UX, the better the data. If reviewers hesitate, overthink, or invent categories, the signal gets worse. The interface should make the right review behavior feel obvious. The app may show suggested human guidance scores before the reviewer saves. Those suggestions can reduce friction, but the reviewer still owns the final score. Human Guidance Evaluation also produces a mean score and can surface hard-fail behavior when any scored dimension is very weak.

Behavioral Observatory structure

Behavioral Observatory adds a focused structured-label layer:

Intent
Guest Affect
Response Strategy
Humanness
Notes

These fields are Lucia-specific behavioral evidence for what the human needed, how the human felt, what Lucia did, how human the response felt, and what evidence should be preserved. Derived suggestions can prefill context, but only saved Behavioral Observatory labels count as persisted label data.

Export principle

Exports should preserve both:

simple employee signal
senior adjudication signal

This lets analysis separate fast human reaction from final canonical meaning.

Adjudication Doctrine Behavioral Observatory

⌘I

​Core idea

​Why freeform-only review fails

​Preferred structure

​Why this improves data quality

​UX principle

​Behavioral Observatory structure

​Export principle

Core idea

Why freeform-only review fails

Preferred structure

Why this improves data quality

UX principle

Behavioral Observatory structure

Export principle