Skip to main content
Review is where Eval Labs becomes useful. The reviewer’s job is to judge behavior honestly, not politely. AI-reviewed platform evidence does not replace this human judgment.

Review order

Use this order:
  1. intent
  2. truth
  3. usefulness
  4. clarity
  5. tone
  6. next move
  7. trust aftertaste

1. Intent

Did Lucia understand what the user was asking? If intent is wrong, the response usually fails. For example, if the user says:
I feel totally out of the loop.
Lucia should not respond with a generic capability menu. That is likely an intent-layer miss.

2. Truth

Did Lucia claim anything she could not know or verify? Truth failures are serious. Examples:
  • saying a vendor was contacted when no dispatch happened
  • saying an issue is resolved when only a suggestion was made
  • implying full confidence when the signal is inferred

3. Usefulness

Did the response help the user move forward? A response can be warm and still useless.

4. Clarity

Was the response easy to understand without extra work? Lucia should not make the operator scan five paragraphs to find the first move.

5. Tone

Was the tone appropriate for the moment? For Lucia, tone should be:
warm
calm
specific
not robotic
not therapy-bot

6. Next move

Did Lucia give the right next move when a next move was needed? Not every prompt requires a task. But distress and ops prompts usually require narrowing.

7. Trust aftertaste

After reading the response, ask:
Do I trust Lucia more, less, or the same?
If the answer is less, write down why.

Saving reviews

Use:
  • Save & Next for non-final prompts
  • Save for the final prompt
  • Finalize Run when the run review is complete
Finalization marks the run lifecycle. It does not replace per-prompt review data.

Reviewer discipline

Do not pass a response just because it sounds smart. Pass it only if it works.
AI-reviewed readiness runs can prove the platform captured and persisted reviews. They cannot prove the human reviewer agrees with the score or that Lucia is ready for real operator use.

Review Queue vs Behavioral Observatory

Review Queue and Behavioral Observatory are related, but they are not the same workflow. Review Queue is where the reviewer scores and reviews the prompt/response item. Behavioral Observatory is where a reviewer can save structured behavioral labels for a conversation:
Intent
Guest Affect
Response Strategy
Humanness
Notes
Registry Diagnostics is separate again. It shows derived dataset and queue-lane suggestions, not saved human labels.

Updated Review Queue flow

Use this practical flow:
  1. read the prompt
  2. read Lucia’s response
  3. review any suggested selections
  4. score the five dimensions with the semantic confidence sliders
  5. answer Quick Review questions
  6. add Human Guidance Evaluation scores when useful
  7. add a short note only if needed
  8. flag senior review when uncertain or concerned
  9. mark reusable learning only when the case teaches a durable lesson
  10. save and move on
If the assignment includes Behavioral Observatory, use the saved-label workflow after reading the conversation carefully. Do not copy derived suggestions blindly.

Quick Review rule

Quick Review is not a test of the reviewer’s AI knowledge. It is a structured way to capture whether Lucia worked for a human. If you are unsure, use the senior review option instead of inventing your own taxonomy.

Escalation rule

Escalate when:
  • Lucia may have overclaimed
  • the response creates risk or confusion
  • intent is unclear
  • the case involves owner stress, money, maintenance, guest trust, or safety
  • the response contains a reusable pattern