This glossary defines the terms employees will see while using Eval Labs and reading the Canon.
Eval
A structured test of Lucia’s behavior. An eval is not only the prompt. It includes the response, review, scoring, and follow-up interpretation.Run
One execution of a suite, generated prompt set, controlled batch, or other Eval Labs test path. A run contains run items. Run truth means the run lifecycle and persisted record agree.Run item
One prompt/response pair inside a run. A run item is the unit reviewed in Review Queue and the unit labeled in Behavioral Observatory.Prompt
The user message being tested. Example:Lucia response
The response generated by Lucia from the Engine under test.Review Queue
The place where a human reviewer evaluates each generated response. The Review Queue is shared by both custom runs and automated runs.Review
The human or AI-generated evaluation record attached to a run item in the Review Queue flow. Review evidence can include ratings, suggested values, Quick Review, Human Guidance Evaluation, notes, save state, and finalization context. Review evidence is not the same as a persisted Behavioral Observatory label.Registry Diagnostics
The read-only diagnostic surface at:Dataset Registry
The canonical diagnostic taxonomy used to group Eval Labs evidence into meaningful dataset categories. In the current Registry Diagnostics surface, dataset use is diagnostic and derived.Dataset
A named group of Eval Labs examples or signals that belong to the same behavioral/product area. A dataset can help organize evaluation evidence, but a derived match is not final human truth.Dataset membership suggestion
A derived suggestion that a run item appears to belong to a dataset. It means:Review queue lane
A workflow lane suggested for a run item. In Registry Diagnostics, lane suggestions are diagnostic and derived. They are not saved queue decisions.Review Queue 2.0
The emerging review-routing model that suggests lanes for existing Eval Labs evidence. Current Registry Diagnostics output is for inspection, not final employee workflow UX.Human Review Queue 2.0
The human-review workflow model behind Review Queue 2.0 lane suggestions. In the current Registry Diagnostics surface, Human Review Queue 2.0 lanes are derived suggestions only. They are not saved queue assignments and they are not Behavioral Observatory labels.Derived signal
A signal inferred from existing Eval Labs data. Derived signals can help prefill, suggest, or inspect behavior. Derived signals are not saved human judgment.Persisted label
A label saved to durable storage and reloadable after refresh. For Behavioral Observatory, a persisted label means Supabase confirmed a row inpublic.eval_behavioral_labels.
Behavioral Observatory
The first-class Eval Labs product surface at:Behavioral label
A saved Behavioral Observatory judgment for a run item. Current fields:public.eval_behavioral_labels when persistence succeeds.
Intent
What the human was trying to do. Behavioral Observatory currently supports:Guest Affect
The human’s emotional state in the conversation. Behavioral Observatory currently supports:Response Strategy
Lucia’s dominant response move. Behavioral Observatory currently supports:Humanness
A 1-7 Behavioral Observatory label for how human Lucia’s response felt. Current anchors:Gold Standard
A high-confidence human-reviewed example that can be used for calibration, training, or future benchmark design. Gold Standard examples require deliberate human judgment. A derived suggestion is not automatically Gold Standard.Custom Prompt Suite
A saved set of 1–10 manually chosen prompts. Use custom suites when testing a specific behavior family repeatedly. Examples:- Overwhelm phrasing
- Lost / out-of-loop prompts
- Payment-risk triage
- Concierge confirmation gaps
- Guest trust repair
- Spanish language handling
Auto-generated 50-Prompt Test
A broader 50-prompt test run generated by Eval Labs for full-spectrum review. Use it for regression coverage after broader changes. Current canonical route:Controlled Batch Runner
The controlled platform-readiness surface used for controlled 1-run smoke, 3-run checkpoint, and 10-run checkpoint batches. It was used to complete the 60-run AI-reviewed platform readiness gate. Owner/admin and evaluator roles can use it in the current access model. Tester cannot. Canonical route:AI-reviewed platform readiness gate
A controlled batch validation protocol that proves Eval Labs platform behavior can complete end to end. It can prove run creation, Lucia response capture, review generation, review persistence, finalization, Run History truth, Global Analysis truth, Supabase count alignment, localStorage compactness, scoped visibility in the tested owner context, and controlled batch lifecycle. It does not prove Lucia is human-approved.Human Lucia-quality approval
The judgment layer where human evaluators decide whether Lucia’s behavior is ready, useful, trustworthy, and operationally appropriate. This remains separate from AI-reviewed platform readiness.Run Source
The source type of the run. Current values:custom means the run came from a user-created prompt suite.
automated means it came from the 50-prompt generated battery.
Tester identity
The logged-in Clerk user who saves or exports review data. Eval Labs records limited identity metadata:Role metadata
The current Clerk public metadata key used by Eval Labs is:Owner role
The privileged Eval Labs role with full access to Home, Launcher, Custom Prompt Test, Auto-generated Prompt Test, Guest Facing Agent Verification Check, Verification Results, Controlled Batch Runner, Run History, Team Review, Global Analysis, Single Run Analysis, review routes, and future admin/tooling surfaces.Admin role
The privileged operational role. Admin has similar access to owner for current testing, evidence inspection, Team Review, Global Analysis, batch runner usage, and evaluator oversight.Evaluator role
The full evaluator workbench role. Evaluators can use evaluator-safe test surfaces and their own run/review/history routes. Evaluators cannot see Team Review, Global Analysis, owner/admin tools, or shared platform-wide evidence unless explicitly widened later.Tester role
The entry-level prompt-testing role. Testers can use Custom Prompt Test and Auto-generated Prompt Test. Testers cannot use Verification Check, Verification Results, Controlled Batch Runner, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools.Run History
The scoped run ledger at:Team Review
The owner/admin oversight surface at:Global Analysis
The read-only behavioral and analytics surface at:Single Run Analysis
The read-only analysis surface for one completed run/session:localStorage compactness
The client persistence doctrine that completed cloud-backed runs should not persist full item-level payloads in localStorage. The readiness diagnostic target is:RLS / backend permission enforcement
Supabase row-level security and backend/API permission checks. Frontend role behavior comes from Clerk public metadata. Persisted evidence protection depends on the Clerk session token carryingeval_labs_role so Supabase RLS can recognize privileged owner/admin access.
Verify the Clerk-to-Supabase role claim path when role metadata, JWT templates, RLS policies, or privileged evidence hydration changes.
exportedBy
The user who exported a session file.
Important: this may differ from the person who originally reviewed the prompts.
savedBy
The user who saved a specific prompt review.
This is more important than exportedBy when auditing human review work.

