Eval Labs Glossary - HelloLucia

This glossary defines the terms employees will see while using Eval Labs and reading the Canon.

Eval

A structured test of Lucia’s behavior. An eval is not only the prompt. It includes the response, review, scoring, and follow-up interpretation.

Run

One execution of a suite, generated prompt set, controlled batch, or other Eval Labs test path. A run contains run items. Run truth means the run lifecycle and persisted record agree.

Run item

One prompt/response pair inside a run. A run item is the unit reviewed in Review Queue and the unit labeled in Behavioral Observatory.

Prompt

The user message being tested. Example:

I feel totally out of the loop.

Lucia response

The response generated by Lucia from the Engine under test.

Review Queue

The place where a human reviewer evaluates each generated response. The Review Queue is shared by both custom runs and automated runs.

Review

The human or AI-generated evaluation record attached to a run item in the Review Queue flow. Review evidence can include ratings, suggested values, Quick Review, Human Guidance Evaluation, notes, save state, and finalization context. Review evidence is not the same as a persisted Behavioral Observatory label.

Registry Diagnostics

The read-only diagnostic surface at:

/registry-diagnostics

Registry Diagnostics inspects existing Eval Labs run/review data and shows derived Dataset Registry membership suggestions and Human Review Queue 2.0 lane suggestions. It does not create labels or save Behavioral Observatory decisions.

Dataset Registry

The canonical diagnostic taxonomy used to group Eval Labs evidence into meaningful dataset categories. In the current Registry Diagnostics surface, dataset use is diagnostic and derived.

Dataset

A named group of Eval Labs examples or signals that belong to the same behavioral/product area. A dataset can help organize evaluation evidence, but a derived match is not final human truth.

Dataset membership suggestion

A derived suggestion that a run item appears to belong to a dataset. It means:

The model found evidence that this item may belong here.

It does not mean:

A human approved this dataset membership.

Review queue lane

A workflow lane suggested for a run item. In Registry Diagnostics, lane suggestions are diagnostic and derived. They are not saved queue decisions.

Review Queue 2.0

The emerging review-routing model that suggests lanes for existing Eval Labs evidence. Current Registry Diagnostics output is for inspection, not final employee workflow UX.

Human Review Queue 2.0

The human-review workflow model behind Review Queue 2.0 lane suggestions. In the current Registry Diagnostics surface, Human Review Queue 2.0 lanes are derived suggestions only. They are not saved queue assignments and they are not Behavioral Observatory labels.

Derived signal

A signal inferred from existing Eval Labs data. Derived signals can help prefill, suggest, or inspect behavior. Derived signals are not saved human judgment.

Persisted label

A label saved to durable storage and reloadable after refresh. For Behavioral Observatory, a persisted label means Supabase confirmed a row in public.eval_behavioral_labels.

Behavioral Observatory

The first-class Eval Labs product surface at:

/behavioral-observatory

Behavioral Observatory lets a reviewer inspect a conversation and save structured behavioral labels.

Behavioral label

A saved Behavioral Observatory judgment for a run item. Current fields:

intent
guest_affect
response_strategy
humanness
notes

Behavioral labels are stored in public.eval_behavioral_labels when persistence succeeds.

Intent

What the human was trying to do. Behavioral Observatory currently supports:

Booking Help
Check-In
Checkout
Billing
Noise
Room Issue
Concierge
Other

Guest Affect

The human’s emotional state in the conversation. Behavioral Observatory currently supports:

Neutral
Mildly Upset
Upset
Grateful

Use the smallest truthful affect. Do not dramatize.

Response Strategy

Lucia’s dominant response move. Behavioral Observatory currently supports:

Acknowledge
Apology
Offer
Escalation

Choose the main strategy, not every strategy present.

Humanness

A 1-7 Behavioral Observatory label for how human Lucia’s response felt. Current anchors:

= Template
= Functional
= Warm + Specific

Humanness is not a substitute for truth, usefulness, or safety.

Gold Standard

A high-confidence human-reviewed example that can be used for calibration, training, or future benchmark design. Gold Standard examples require deliberate human judgment. A derived suggestion is not automatically Gold Standard.

Custom Prompt Suite

A saved set of 1–10 manually chosen prompts. Use custom suites when testing a specific behavior family repeatedly. Examples:

Overwhelm phrasing
Lost / out-of-loop prompts
Payment-risk triage
Concierge confirmation gaps
Guest trust repair
Spanish language handling

Auto-generated 50-Prompt Test

A broader 50-prompt test run generated by Eval Labs for full-spectrum review. Use it for regression coverage after broader changes. Current canonical route:

/lucia/auto-generated

Legacy inbound alias:

/lucia/automated

Controlled Batch Runner

The controlled platform-readiness surface used for controlled 1-run smoke, 3-run checkpoint, and 10-run checkpoint batches. It was used to complete the 60-run AI-reviewed platform readiness gate. Owner/admin and evaluator roles can use it in the current access model. Tester cannot. Canonical route:

/lucia/batch-runner

AI-reviewed platform readiness gate

A controlled batch validation protocol that proves Eval Labs platform behavior can complete end to end. It can prove run creation, Lucia response capture, review generation, review persistence, finalization, Run History truth, Global Analysis truth, Supabase count alignment, localStorage compactness, scoped visibility in the tested owner context, and controlled batch lifecycle. It does not prove Lucia is human-approved.

Human Lucia-quality approval

The judgment layer where human evaluators decide whether Lucia’s behavior is ready, useful, trustworthy, and operationally appropriate. This remains separate from AI-reviewed platform readiness.

Run Source

The source type of the run. Current values:

custom
automated
manual

custom means the run came from a user-created prompt suite. automated means it came from the 50-prompt generated battery.

Tester identity

The logged-in Clerk user who saves or exports review data. Eval Labs records limited identity metadata:

Clerk user id
email
display name when available

This helps us know who evaluated a response.

Role metadata

The current Clerk public metadata key used by Eval Labs is:

{
  "eval_labs_role": "owner"
}

Supported values are:

owner
admin
evaluator
tester

Missing or unknown role metadata should not grant privileged access.

Owner role

The privileged Eval Labs role with full access to Home, Launcher, Custom Prompt Test, Auto-generated Prompt Test, Guest Facing Agent Verification Check, Verification Results, Controlled Batch Runner, Run History, Team Review, Global Analysis, Single Run Analysis, review routes, and future admin/tooling surfaces.

Admin role

The privileged operational role. Admin has similar access to owner for current testing, evidence inspection, Team Review, Global Analysis, batch runner usage, and evaluator oversight.

Evaluator role

The full evaluator workbench role. Evaluators can use evaluator-safe test surfaces and their own run/review/history routes. Evaluators cannot see Team Review, Global Analysis, owner/admin tools, or shared platform-wide evidence unless explicitly widened later.

Tester role

The entry-level prompt-testing role. Testers can use Custom Prompt Test and Auto-generated Prompt Test. Testers cannot use Verification Check, Verification Results, Controlled Batch Runner, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools.

Run History

The scoped run ledger at:

/lucia/automated/runs

It records completed/finalized run truth and may include scoped operational run state.

Team Review

The owner/admin oversight surface at:

/team-review

Team Review groups evaluator activity, review gaps, flags, recent work, and evidence that needs privileged attention.

Global Analysis

The read-only behavioral and analytics surface at:

/analysis

Global Analysis is owner/admin-only in the current model. It shows AI-analyzed platform evidence, not human Lucia-quality approval. The legacy alias is:

/experiments

Single Run Analysis

The read-only analysis surface for one completed run/session:

/analysis/runs/:sessionId

It can include run metadata, behavioral summaries, item rows, copy controls, and deep links.

localStorage compactness

The client persistence doctrine that completed cloud-backed runs should not persist full item-level payloads in localStorage. The readiness diagnostic target is:

persistedLocalFullPayloadSessionCount = 0
persistedLocalHasItemLevelData = false
persistedLocalItemLevelDataSessionCount = 0
ownedSessionCount = expected run count
otherOwnerSessionCount = 0
ownerlessSessionCount = 0

RLS / backend permission enforcement

Supabase row-level security and backend/API permission checks. Frontend role behavior comes from Clerk public metadata. Persisted evidence protection depends on the Clerk session token carrying eval_labs_role so Supabase RLS can recognize privileged owner/admin access. Verify the Clerk-to-Supabase role claim path when role metadata, JWT templates, RLS policies, or privileged evidence hydration changes.

`exportedBy`

The user who exported a session file. Important: this may differ from the person who originally reviewed the prompts.

`savedBy`

The user who saved a specific prompt review. This is more important than exportedBy when auditing human review work.

`savedAt`

The time a review was saved.

Intent layer

The part of Lucia responsible for interpreting what kind of user message was sent and routing it into the correct behavior mode. If a distress prompt routes to a generic capability redirect, that is usually an intent-layer failure.

Emotional containment

Lucia’s ability to reduce felt chaos without becoming therapy-bot language. Containment means Lucia narrows the field and gives one clear next move.

Trust-state discipline

Lucia’s habit of distinguishing what is known, inferred, suggested, requested, confirmed, or not yet done. A trust-state failure is serious.

Truth-state

The specific truth status of a claim or action. Examples:

known
inferred
suggested
requested
confirmed
not yet done

Truth-state is the thing being preserved. Trust-state discipline is the habit of preserving it.

Regression

A behavior that used to work but breaks after a code, prompt, model, or configuration change. Eval Labs exists largely to detect regressions before they become product damage.

Employee Review

The fast, guided review layer used by non-expert reviewers. Employee Review captures observable human judgment without asking employees to invent labels or taxonomies.

Quick Review

The guided question flow inside the Review Queue. It asks simple questions such as whether Lucia understood the need, gave the right next move, calmed the situation, or created risk.

Adjudication

The senior-review process that assigns final meaning to ambiguous, high-risk, or reusable cases. Adjudication converts review signal into canonical training signal.

Senior reviewer

A reviewer trusted to inspect escalated cases, resolve ambiguity, assign final labels, and decide whether a case should become a canon candidate.

`reviewState`

The workflow state for a prompt review. Current values include:

clean_pass
needs_review
needs_adjudication
canon_candidate

Needs final call

Employee-facing language for a case that needs senior adjudication.

Canon candidate

A response or failure pattern that may teach Lucia something durable enough to enter Canon, eval suites, or future training guidance.

Ontology drift

The quality failure that happens when reviewers invent inconsistent categories, labels, or meanings over time. Eval Labs prevents ontology drift by separating Employee Review from Adjudication.

Semantic confidence bar

The stepped 1–10 slider used for scoring dimensions. It uses restrained color and fill behavior to help reviewers feel score quality without extra interpretation.

​Eval

​Run

​Run item

​Prompt

​Lucia response

​Review Queue

​Review

​Registry Diagnostics

​Dataset Registry

​Dataset

​Dataset membership suggestion

​Review queue lane

​Review Queue 2.0

​Human Review Queue 2.0

​Derived signal

​Persisted label

​Behavioral Observatory

​Behavioral label

​Intent

​Guest Affect

​Response Strategy

​Humanness

​Gold Standard

​Custom Prompt Suite

​Auto-generated 50-Prompt Test

​Controlled Batch Runner

​AI-reviewed platform readiness gate

​Human Lucia-quality approval

​Run Source

​Tester identity

​Role metadata

​Owner role

​Admin role

​Evaluator role

​Tester role

​Run History

​Team Review

​Global Analysis

​Single Run Analysis

​localStorage compactness

​RLS / backend permission enforcement

​exportedBy

​savedBy

​savedAt

​Intent layer

​Emotional containment

​Trust-state discipline

​Truth-state

​Regression

​Employee Review

​Quick Review

​Adjudication

​Senior reviewer

​reviewState

​Needs final call

​Canon candidate

​Ontology drift

​Semantic confidence bar

Eval

Run

Run item

Prompt

Lucia response

Review Queue

Review

Registry Diagnostics

Dataset Registry

Dataset

Dataset membership suggestion

Review queue lane

Review Queue 2.0

Human Review Queue 2.0

Derived signal

Persisted label

Behavioral Observatory

Behavioral label

Intent

Guest Affect

Response Strategy

Humanness

Gold Standard

Custom Prompt Suite

Auto-generated 50-Prompt Test

Controlled Batch Runner

AI-reviewed platform readiness gate

Human Lucia-quality approval

Run Source

Tester identity

Role metadata

Owner role

Admin role

Evaluator role

Tester role

Run History

Team Review

Global Analysis

Single Run Analysis

localStorage compactness

RLS / backend permission enforcement

`exportedBy`

`savedBy`

`savedAt`

Intent layer

Emotional containment

Trust-state discipline

Truth-state

Regression

Employee Review

Quick Review

Adjudication

Senior reviewer

`reviewState`

Needs final call

Canon candidate

Ontology drift

Semantic confidence bar