Skip to main content
This page records the May 2026 review-layer evolution: shared run launchers, employee review, suggested selections, Human Guidance Evaluation, adjudication-ready schema, exports, queue filters, lifecycle finalization, and the later platform-readiness split between normal testing and controlled batch gates.

May 2026 review-layer milestone

Eval Labs evolved from a prompt runner into a layered review product. The key change:
custom or automated runshared Review Queuesuggested selections plus human reviewlifecycle finalization and export

Major shipped changes

Adjudication-ready schema

Added review model support for:
reviewState
luciaPredictedLabels
humanLabels
adjudication
employeeReview
suggestedEmployeeReview
humanGuidanceEval
suggestedHumanGuidanceEval
canonCandidate
reviewLifecycle
These fields are preserved through local storage, Supabase payload persistence, dirty-state detection, and exports.

Employee Review layer

Added guided employee-review fields:
understoodNeed
rightNextMove
calmingEffect
riskOrConfusion
seniorReview
reusableLearning
These replace freeform taxonomy collection for non-expert reviewers. The app can also suggest Employee Review answers from prompt/response heuristics. The suggestion is visible as suggested signal; the reviewer still saves the human review.

Suggested review layer

Added app-suggested review values for:
1-10 ratings
keepTalking
pass / refine / fail
priority
Employee Review answers
1-5 Human Guidance Evaluation scores
These suggestions come from prompt text, Lucia response text, run status, run errors, and simple response-quality heuristics such as clear next move, calming language, list-heavy output, robotic language, fake empathy, and overclaiming. They are not canonical truth.

Review Queue UX

The Review Queue now favors guided employee judgment:
  • single-column Quick Review flow
  • numbered question cards
  • separate selection boxes
  • suggested selections
  • reduced freeform text burden
  • senior-review routing
  • canon-candidate routing
  • Human Guidance Evaluation
  • Save / Save & Next / Save & next flagged flows
  • search and workflow filters
  • JSON, CSV, and Markdown export controls
  • finalization after all prompts are reviewed

Semantic confidence sliders

The “How did Lucia do?” scoring section moved from 1–10 button rows to stepped semantic confidence sliders. The final design direction:
low score → muted concern
mid score → soft uncertainty
high score → restrained confidence
The sliders should feel like native OS controls: calm, premium, tactile, and low-friction.

Adjudication queue filters

Added workflow queue filters for:
Needs final call
Canon candidates
This lets senior review focus on the cases that matter most. This release supports adjudication routing, metadata, and exports. It does not depend on a separate senior-adjudication editing screen.

Exports

JSON, CSV, and Markdown exports now preserve structured review, suggested review, Employee Review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and prompt dirty/completion state.

Supabase persistence

Supabase persistence now stores run lifecycle metadata on eval_runs, embeds the full case and prompt review record in eval_run_items.payload, and writes eval_item_reviews rows for review persistence. Hydration prefers the embedded eval_run_items.payload.promptRecord over fragile review-table reads.

Current doctrine impact

This release established a new Eval Labs principle:
The app may suggest.
The reviewer must decide.
Senior meaning stays separate from employee signal.
This should be protected in future product work.

Product surface refinement

After the review-layer release, Eval Labs was refined into a clearer product surface:
  • top app shell owns page identity
  • in-page blog-style mastheads were removed from the app
  • Custom Prompt Test, Auto-generated Prompt Test, and Controlled Batch Runner are separate surfaces
  • Controlled Batch Runner is controlled readiness tooling; current access is owner/admin/evaluator, not tester
  • Auto-generated Prompt Test remains the normal 50-prompt generated tester
  • Run History rows use a standardized two-zone layout
  • copy controls use Copy Session ID / Copy Deep Link patterns across key surfaces
  • Single Run Analysis gives read-only run-level evidence outside the Review Queue

Readiness doctrine added

The AI-reviewed platform readiness gate passed after 60 completed runs and 3,000 reviewed prompts. This extends the review-layer doctrine:
The app may suggest.
The reviewer must decide.
AI-reviewed platform readiness is not human Lucia-quality approval.
Protect this distinction in future release notes and onboarding language.