Running Your First Eval

This page explains the safe first-run workflow for approved reviewers. It does not replace the employee onboarding gate.

Before you begin

Make sure you know which testing path you are using:

Custom Prompt Test = targeted refinement
Auto-generated 50-Prompt Test = broad regression coverage
Guest Facing Agent Verification Check = booked-guest verification behavior
Controlled Batch Runner = controlled platform-readiness tooling

If you are a tester, use only Custom Prompt Test or Auto-generated Prompt Test. If you are an evaluator, use only the evaluator-safe surfaces assigned for the work. Do not use Team Review, Global Analysis, Single Run Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools unless your role explicitly allows it.

First custom smoke test

Use this prompt:

What time is it?

Expected result:

Lucia responds with current time
run completes
Review Queue opens
no transport failure
export contains runSource: custom

For evaluator and tester users, the run must be scoped to the signed-in user before review/finalization access is considered valid.

First real review test

Choose a small behavior family. Example:

I'm overwhelmed.
I feel behind.
I am so lost.
I feel totally out of the loop.
I don't trust that I know what's going on.

Run the suite. Then review each response.

What to do in the Review Queue

For each item:

Read the prompt.
Read Lucia’s response.
Review any suggested selections.
Score each dimension honestly.
Choose Keep talking, Verdict, and Priority.
Answer the Quick Review questions.
Add Human Guidance Evaluation scores when useful.
Write notes when something feels off.
Save the review.

The last item should show Save, not Save & Next. After the last item is saved, use the completion actions:

Finalize Run
Back to Launcher

Export after reviewing

Export after review when you need to share evidence with product or engineering. Do not export only the generated responses if the goal is human review analysis. Generated-only exports are useful for debugging, but reviewed exports are stronger evidence. Reviewed exports preserve the structured review, suggested review, Employee Review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and dirty/completion state.

Finalize Evaluation

Finalize run and back to launcher action buttons.

Final item in Review Queue with Save button instead of Save & Next button. Finalize only after every prompt has been reviewed. Finalization marks the run lifecycle; it does not replace the per-prompt review data.

Not part of first tester workflow

Tester users should not use:

Guest Facing Agent Verification Check
Controlled Batch Runner
Run History/global analytics
Team Review
Global Analysis
Single Run Analysis
owner/admin Home dashboard

Evaluator users should use verification, controlled batch, and scoped Run History only when assigned.

Eval Labs Roles and Access Matrix Custom Prompt Suites

⌘I

​Before you begin

​First custom smoke test

​First real review test

​What to do in the Review Queue

​Export after reviewing

​Finalize Evaluation

​Not part of first tester workflow

Before you begin

First custom smoke test

First real review test

What to do in the Review Queue

Export after reviewing

Finalize Evaluation

Not part of first tester workflow