Eval Labs is only useful when reviewers are consistent, honest, specific, and operating inside their approved access scope.
Reviewer expectations
Reviewers should be:- honest
- specific
- consistent
- grounded in the quality bar
- willing to fail polished responses
- careful with emotional signals
- precise in notes
Do
- write clear notes
- identify patterns
- flag truth issues
- mark uncertainty
- compare against the user’s actual need
- save reviews before exporting reviewed evidence
- use custom suites for targeted refinement
- use auto-generated runs for broader regression checks only when your role allows it
- keep AI-reviewed platform readiness separate from human Lucia-quality approval
- keep derived diagnostic suggestions separate from saved Behavioral Observatory labels
Do not
- pass weak responses to be nice
- reward fancy wording
- ignore tone failures
- skip notes on borderline responses
- treat one lucky response as proof
- mix too many behavior families into one custom suite
- confuse generated-only exports with reviewed exports
- treat controlled batch results as human approval of Lucia quality
- treat Registry Diagnostics suggestions as saved labels
- treat Behavioral Observatory labels as global Lucia approval
- use owner/admin surfaces from an evaluator role
Review notes
Good notes sound like:Team standard
If another teammate cannot understand your review note, it is not specific enough.When to escalate
Escalate a pattern when:- the same failure appears across 3+ related prompts
- the failure affects trust
- the failure affects distress handling
- the failure causes wrong operational prioritization
- the failure appears after a new deploy
What not to escalate
Do not escalate a single minor wording preference unless it represents a broader pattern. Eval Labs is for product signal, not personal taste fights.Updated reviewer guidance
Employees should prioritize speed, honesty, and consistency. Do:- use the guided controls
- flag senior review when uncertain
- write short notes only when they add context
- mark reusable learning only when the pattern feels durable
- invent new labels
- write long essays
- create taxonomy language
- treat personal taste as product signal
- overthink every prompt

