Skip to main content
This glossary defines the terms employees will see while using Eval Labs and reading the Canon.

Eval

A structured test of Lucia’s behavior. An eval is not only the prompt. It includes the response, review, scoring, and follow-up interpretation.

Run

One execution of a suite, generated prompt set, controlled batch, or other Eval Labs test path. A run contains run items. Run truth means the run lifecycle and persisted record agree.

Run item

One prompt/response pair inside a run. A run item is the unit reviewed in Review Queue and the unit labeled in Behavioral Observatory.

Prompt

The user message being tested. Example:
I feel totally out of the loop.

Lucia response

The response generated by Lucia from the Engine under test.

Review Queue

The place where a human reviewer evaluates each generated response. The Review Queue is shared by both custom runs and automated runs.

Review

The human or AI-generated evaluation record attached to a run item in the Review Queue flow. Review evidence can include ratings, suggested values, Quick Review, Human Guidance Evaluation, notes, save state, and finalization context. Review evidence is not the same as a persisted Behavioral Observatory label.

Registry Diagnostics

The read-only diagnostic surface at:
/registry-diagnostics
Registry Diagnostics inspects existing Eval Labs run/review data and shows derived Dataset Registry membership suggestions and Human Review Queue 2.0 lane suggestions. It does not create labels or save Behavioral Observatory decisions.

Dataset Registry

The canonical diagnostic taxonomy used to group Eval Labs evidence into meaningful dataset categories. In the current Registry Diagnostics surface, dataset use is diagnostic and derived.

Dataset

A named group of Eval Labs examples or signals that belong to the same behavioral/product area. A dataset can help organize evaluation evidence, but a derived match is not final human truth.

Dataset membership suggestion

A derived suggestion that a run item appears to belong to a dataset. It means:
The model found evidence that this item may belong here.
It does not mean:
A human approved this dataset membership.

Review queue lane

A workflow lane suggested for a run item. In Registry Diagnostics, lane suggestions are diagnostic and derived. They are not saved queue decisions.

Review Queue 2.0

The emerging review-routing model that suggests lanes for existing Eval Labs evidence. Current Registry Diagnostics output is for inspection, not final employee workflow UX.

Human Review Queue 2.0

The human-review workflow model behind Review Queue 2.0 lane suggestions. In the current Registry Diagnostics surface, Human Review Queue 2.0 lanes are derived suggestions only. They are not saved queue assignments and they are not Behavioral Observatory labels.

Derived signal

A signal inferred from existing Eval Labs data. Derived signals can help prefill, suggest, or inspect behavior. Derived signals are not saved human judgment.

Persisted label

A label saved to durable storage and reloadable after refresh. For Behavioral Observatory, a persisted label means Supabase confirmed a row in public.eval_behavioral_labels.

Behavioral Observatory

The first-class Eval Labs product surface at:
/behavioral-observatory
Behavioral Observatory lets a reviewer inspect a conversation and save structured behavioral labels.

Behavioral label

A saved Behavioral Observatory judgment for a run item. Current fields:
intent
guest_affect
response_strategy
humanness
notes
Behavioral labels are stored in public.eval_behavioral_labels when persistence succeeds.

Intent

What the human was trying to do. Behavioral Observatory currently supports:
Booking Help
Check-In
Checkout
Billing
Noise
Room Issue
Concierge
Other

Guest Affect

The human’s emotional state in the conversation. Behavioral Observatory currently supports:
Neutral
Mildly Upset
Upset
Grateful
Use the smallest truthful affect. Do not dramatize.

Response Strategy

Lucia’s dominant response move. Behavioral Observatory currently supports:
Acknowledge
Apology
Offer
Escalation
Choose the main strategy, not every strategy present.

Humanness

A 1-7 Behavioral Observatory label for how human Lucia’s response felt. Current anchors:
1 = Template
4 = Functional
7 = Warm + Specific
Humanness is not a substitute for truth, usefulness, or safety.

Gold Standard

A high-confidence human-reviewed example that can be used for calibration, training, or future benchmark design. Gold Standard examples require deliberate human judgment. A derived suggestion is not automatically Gold Standard.

Custom Prompt Suite

A saved set of 1–10 manually chosen prompts. Use custom suites when testing a specific behavior family repeatedly. Examples:
  • Overwhelm phrasing
  • Lost / out-of-loop prompts
  • Payment-risk triage
  • Concierge confirmation gaps
  • Guest trust repair
  • Spanish language handling

Auto-generated 50-Prompt Test

A broader 50-prompt test run generated by Eval Labs for full-spectrum review. Use it for regression coverage after broader changes. Current canonical route:
/lucia/auto-generated
Legacy inbound alias:
/lucia/automated

Controlled Batch Runner

The controlled platform-readiness surface used for controlled 1-run smoke, 3-run checkpoint, and 10-run checkpoint batches. It was used to complete the 60-run AI-reviewed platform readiness gate. Owner/admin and evaluator roles can use it in the current access model. Tester cannot. Canonical route:
/lucia/batch-runner

AI-reviewed platform readiness gate

A controlled batch validation protocol that proves Eval Labs platform behavior can complete end to end. It can prove run creation, Lucia response capture, review generation, review persistence, finalization, Run History truth, Global Analysis truth, Supabase count alignment, localStorage compactness, scoped visibility in the tested owner context, and controlled batch lifecycle. It does not prove Lucia is human-approved.

Human Lucia-quality approval

The judgment layer where human evaluators decide whether Lucia’s behavior is ready, useful, trustworthy, and operationally appropriate. This remains separate from AI-reviewed platform readiness.

Run Source

The source type of the run. Current values:
custom
automated
manual
custom means the run came from a user-created prompt suite. automated means it came from the 50-prompt generated battery.

Tester identity

The logged-in Clerk user who saves or exports review data. Eval Labs records limited identity metadata:
Clerk user id
email
display name when available
This helps us know who evaluated a response.

Role metadata

The current Clerk public metadata key used by Eval Labs is:
{
  "eval_labs_role": "owner"
}
Supported values are:
owner
admin
evaluator
tester
Missing or unknown role metadata should not grant privileged access.

Owner role

The privileged Eval Labs role with full access to Home, Launcher, Custom Prompt Test, Auto-generated Prompt Test, Guest Facing Agent Verification Check, Verification Results, Controlled Batch Runner, Run History, Team Review, Global Analysis, Single Run Analysis, review routes, and future admin/tooling surfaces.

Admin role

The privileged operational role. Admin has similar access to owner for current testing, evidence inspection, Team Review, Global Analysis, batch runner usage, and evaluator oversight.

Evaluator role

The full evaluator workbench role. Evaluators can use evaluator-safe test surfaces and their own run/review/history routes. Evaluators cannot see Team Review, Global Analysis, owner/admin tools, or shared platform-wide evidence unless explicitly widened later.

Tester role

The entry-level prompt-testing role. Testers can use Custom Prompt Test and Auto-generated Prompt Test. Testers cannot use Verification Check, Verification Results, Controlled Batch Runner, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools.

Run History

The scoped run ledger at:
/lucia/automated/runs
It records completed/finalized run truth and may include scoped operational run state.

Team Review

The owner/admin oversight surface at:
/team-review
Team Review groups evaluator activity, review gaps, flags, recent work, and evidence that needs privileged attention.

Global Analysis

The read-only behavioral and analytics surface at:
/analysis
Global Analysis is owner/admin-only in the current model. It shows AI-analyzed platform evidence, not human Lucia-quality approval. The legacy alias is:
/experiments

Single Run Analysis

The read-only analysis surface for one completed run/session:
/analysis/runs/:sessionId
It can include run metadata, behavioral summaries, item rows, copy controls, and deep links.

localStorage compactness

The client persistence doctrine that completed cloud-backed runs should not persist full item-level payloads in localStorage. The readiness diagnostic target is:
persistedLocalFullPayloadSessionCount = 0
persistedLocalHasItemLevelData = false
persistedLocalItemLevelDataSessionCount = 0
ownedSessionCount = expected run count
otherOwnerSessionCount = 0
ownerlessSessionCount = 0

RLS / backend permission enforcement

Supabase row-level security and backend/API permission checks. Frontend role behavior comes from Clerk public metadata. Persisted evidence protection depends on the Clerk session token carrying eval_labs_role so Supabase RLS can recognize privileged owner/admin access. Verify the Clerk-to-Supabase role claim path when role metadata, JWT templates, RLS policies, or privileged evidence hydration changes.

exportedBy

The user who exported a session file. Important: this may differ from the person who originally reviewed the prompts.

savedBy

The user who saved a specific prompt review. This is more important than exportedBy when auditing human review work.

savedAt

The time a review was saved.

Intent layer

The part of Lucia responsible for interpreting what kind of user message was sent and routing it into the correct behavior mode. If a distress prompt routes to a generic capability redirect, that is usually an intent-layer failure.

Emotional containment

Lucia’s ability to reduce felt chaos without becoming therapy-bot language. Containment means Lucia narrows the field and gives one clear next move.

Trust-state discipline

Lucia’s habit of distinguishing what is known, inferred, suggested, requested, confirmed, or not yet done. A trust-state failure is serious.

Truth-state

The specific truth status of a claim or action. Examples:
known
inferred
suggested
requested
confirmed
not yet done
Truth-state is the thing being preserved. Trust-state discipline is the habit of preserving it.

Regression

A behavior that used to work but breaks after a code, prompt, model, or configuration change. Eval Labs exists largely to detect regressions before they become product damage.

Employee Review

The fast, guided review layer used by non-expert reviewers. Employee Review captures observable human judgment without asking employees to invent labels or taxonomies.

Quick Review

The guided question flow inside the Review Queue. It asks simple questions such as whether Lucia understood the need, gave the right next move, calmed the situation, or created risk.

Adjudication

The senior-review process that assigns final meaning to ambiguous, high-risk, or reusable cases. Adjudication converts review signal into canonical training signal.

Senior reviewer

A reviewer trusted to inspect escalated cases, resolve ambiguity, assign final labels, and decide whether a case should become a canon candidate.

reviewState

The workflow state for a prompt review. Current values include:
clean_pass
needs_review
needs_adjudication
canon_candidate

Needs final call

Employee-facing language for a case that needs senior adjudication.

Canon candidate

A response or failure pattern that may teach Lucia something durable enough to enter Canon, eval suites, or future training guidance.

Ontology drift

The quality failure that happens when reviewers invent inconsistent categories, labels, or meanings over time. Eval Labs prevents ontology drift by separating Employee Review from Adjudication.

Semantic confidence bar

The stepped 1–10 slider used for scoring dimensions. It uses restrained color and fill behavior to help reviewers feel score quality without extra interpretation.