Eval Labs Review Layer Release Notes

This page records the May 2026 review-layer evolution: shared run launchers, employee review, suggested selections, Human Guidance Evaluation, adjudication-ready schema, exports, queue filters, lifecycle finalization, and the later platform-readiness split between normal testing and controlled batch gates.

May 2026 review-layer milestone

Eval Labs evolved from a prompt runner into a layered review product. The key change:

custom or automated runshared Review Queuesuggested selections plus human reviewlifecycle finalization and export

Major shipped changes

Adjudication-ready schema

Added review model support for:

reviewState
luciaPredictedLabels
humanLabels
adjudication
employeeReview
suggestedEmployeeReview
humanGuidanceEval
suggestedHumanGuidanceEval
canonCandidate
reviewLifecycle

These fields are preserved through local storage, Supabase payload persistence, dirty-state detection, and exports.

Employee Review layer

Added guided employee-review fields:

understoodNeed
rightNextMove
calmingEffect
riskOrConfusion
seniorReview
reusableLearning

These replace freeform taxonomy collection for non-expert reviewers. The app can also suggest Employee Review answers from prompt/response heuristics. The suggestion is visible as suggested signal; the reviewer still saves the human review.

Suggested review layer

Added app-suggested review values for:

1-10 ratings
keepTalking
pass / refine / fail
priority
Employee Review answers
1-5 Human Guidance Evaluation scores

These suggestions come from prompt text, Lucia response text, run status, run errors, and simple response-quality heuristics such as clear next move, calming language, list-heavy output, robotic language, fake empathy, and overclaiming. They are not canonical truth.

Review Queue UX

The Review Queue now favors guided employee judgment:

single-column Quick Review flow
numbered question cards
separate selection boxes
suggested selections
reduced freeform text burden
senior-review routing
canon-candidate routing
Human Guidance Evaluation
Save / Save & Next / Save & next flagged flows
search and workflow filters
JSON, CSV, and Markdown export controls
finalization after all prompts are reviewed

Semantic confidence sliders

The “How did Lucia do?” scoring section moved from 1–10 button rows to stepped semantic confidence sliders. The final design direction:

low score → muted concern
mid score → soft uncertainty
high score → restrained confidence

The sliders should feel like native OS controls: calm, premium, tactile, and low-friction.

Adjudication queue filters

Added workflow queue filters for:

Needs final call
Canon candidates

This lets senior review focus on the cases that matter most. This release supports adjudication routing, metadata, and exports. It does not depend on a separate senior-adjudication editing screen.

Exports

JSON, CSV, and Markdown exports now preserve structured review, suggested review, Employee Review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and prompt dirty/completion state.

Supabase persistence

Supabase persistence now stores run lifecycle metadata on eval_runs, embeds the full case and prompt review record in eval_run_items.payload, and writes eval_item_reviews rows for review persistence. Hydration prefers the embedded eval_run_items.payload.promptRecord over fragile review-table reads.

Current doctrine impact

This release established a new Eval Labs principle:

The app may suggest.
The reviewer must decide.
Senior meaning stays separate from employee signal.

This should be protected in future product work.

After the review-layer release, Eval Labs was refined into a clearer product surface:

top app shell owns page identity
in-page blog-style mastheads were removed from the app
Custom Prompt Test, Auto-generated Prompt Test, and Controlled Batch Runner are separate surfaces
Controlled Batch Runner is controlled readiness tooling; current access is owner/admin/evaluator, not tester
Auto-generated Prompt Test remains the normal 50-prompt generated tester
Run History rows use a standardized two-zone layout
copy controls use Copy Session ID / Copy Deep Link patterns across key surfaces
Single Run Analysis gives read-only run-level evidence outside the Review Queue

Readiness doctrine added

The AI-reviewed platform readiness gate passed after 60 completed runs and 3,000 reviewed prompts. This extends the review-layer doctrine:

The app may suggest.
The reviewer must decide.
AI-reviewed platform readiness is not human Lucia-quality approval.

Protect this distinction in future release notes and onboarding language.

​May 2026 review-layer milestone

​Major shipped changes

​Adjudication-ready schema

​Employee Review layer

​Suggested review layer

​Review Queue UX

​Semantic confidence sliders

​Adjudication queue filters

​Exports

​Supabase persistence

​Current doctrine impact

​Product surface refinement

​Readiness doctrine added

May 2026 review-layer milestone

Major shipped changes

Adjudication-ready schema

Employee Review layer

Suggested review layer

Review Queue UX

Semantic confidence sliders

Adjudication queue filters

Exports

Supabase persistence

Current doctrine impact

Product surface refinement

Readiness doctrine added