Skip to main content
Eval Labs is only useful when reviewers are consistent, honest, specific, and operating inside their approved access scope.

Reviewer expectations

Reviewers should be:
  • honest
  • specific
  • consistent
  • grounded in the quality bar
  • willing to fail polished responses
  • careful with emotional signals
  • precise in notes

Do

  • write clear notes
  • identify patterns
  • flag truth issues
  • mark uncertainty
  • compare against the user’s actual need
  • save reviews before exporting reviewed evidence
  • use custom suites for targeted refinement
  • use auto-generated runs for broader regression checks only when your role allows it
  • keep AI-reviewed platform readiness separate from human Lucia-quality approval
  • keep derived diagnostic suggestions separate from saved Behavioral Observatory labels

Do not

  • pass weak responses to be nice
  • reward fancy wording
  • ignore tone failures
  • skip notes on borderline responses
  • treat one lucky response as proof
  • mix too many behavior families into one custom suite
  • confuse generated-only exports with reviewed exports
  • treat controlled batch results as human approval of Lucia quality
  • treat Registry Diagnostics suggestions as saved labels
  • treat Behavioral Observatory labels as global Lucia approval
  • use owner/admin surfaces from an evaluator role

Review notes

Good notes sound like:
Correct operational priority, but Lucia missed the user's disorientation signal and did not provide containment.
Bad notes sound like:
Seems fine.

Team standard

If another teammate cannot understand your review note, it is not specific enough.

When to escalate

Escalate a pattern when:
  • the same failure appears across 3+ related prompts
  • the failure affects trust
  • the failure affects distress handling
  • the failure causes wrong operational prioritization
  • the failure appears after a new deploy

What not to escalate

Do not escalate a single minor wording preference unless it represents a broader pattern. Eval Labs is for product signal, not personal taste fights.

Updated reviewer guidance

Employees should prioritize speed, honesty, and consistency. Do:
  • use the guided controls
  • flag senior review when uncertain
  • write short notes only when they add context
  • mark reusable learning only when the pattern feels durable
Do not:
  • invent new labels
  • write long essays
  • create taxonomy language
  • treat personal taste as product signal
  • overthink every prompt
The goal is clean signal, not intellectual performance.

Access rule

Access is role-based by design. Testers should use only Custom Prompt Test and Auto-generated Prompt Test. Evaluators should use evaluator-safe test surfaces and their own run/review/history routes. Owner/admin-only surfaces include Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, all-user analytics, cleanup/tools, and future admin/tools. Do not onboard broader employee workflows until Employee Onboarding Gate is satisfied. For the simple surface-by-surface path, read Eval Labs Step-by-Step Operator Guide.