Skip to main content
Eval Labs data is designed to preserve prompt context, run source, Lucia response, layered review judgment, lifecycle state, and tester identity.

Session metadata

A session export includes metadata such as:
id
title
mode
runSource
category
subcategory
templateKey
promptCount
status
createdAt
updatedAt
adminBranch
engineBranch
runFailureType
runFailureReason
runFailureAt
reviewLifecycle
remoteRunId
ownerUserId
ownerScopeVersion
localPayloadState
The important field for the custom launcher is:
runSource: custom
Custom and Auto-generated launchers both create run sessions that flow into the same Review Queue. The session mode remains the run mechanics; runSource distinguishes whether the run came from the Custom launcher or the Auto-generated launcher. Controlled batch readiness runs also rely on the same run/session lifecycle, but their product meaning is different: they are platform readiness evidence, not normal evaluator-facing tests. Registry Diagnostics reads existing session/run/review evidence and derives dataset membership and review-lane suggestions from it. Those suggestions are diagnostic output, not saved labels. Behavioral Observatory labels are separate persisted records when saved through the Behavioral Observatory flow.

Cases

Each prompt becomes a case. A case contains:
id
sessionId
orderIndex
sourceType
title
promptText
promptLocked
luciaResponse
runStatus
runError
category
subcategory
templateKey
createdAt
updatedAt
Order matters. The exported orderIndex must remain stable.

Prompt results

Prompt results include:
  • draft review state, including reviewer input and suggestions
  • saved review state
  • saved timestamp
  • saved-by tester identity
  • completion/dirty state derived from saved vs draft review
A generated-but-unreviewed item may have null ratings and savedBy: null. That is expected. A saved review should include savedBy.

exportedBy vs savedBy

exportedBy identifies who exported the file. savedBy identifies who reviewed/saved the individual prompt. This distinction matters because one person may export a run that another person reviewed.

Tester identity fields

Eval Labs stores only limited identity fields:
clerkUserId
email
name
No unnecessary Clerk metadata should be stored. Role gating reads Clerk public metadata through eval_labs_role, but role metadata is product access state, not review authorship.

Nulls in exports

Some nulls are normal. Expected nulls include:
runFailureType: null
runFailureReason: null
runFailureAt: null
savedBy: null when not reviewed yet
ratings: null when not scored yet
Do not treat every null as a bug. Treat nulls as suspicious only when the workflow step should have populated them.

Export Options and Example

export-controls Export controls for easy usability with multiple data formats.
{
  "format": "lucia-eval-lab-session/v0.3",
  "exportedAt": "2026-04-29T19:33:10.000Z",
  "exportedBy": {
    "clerkUserId": "user_3D2BItLYUO1uqJOqzlZTvHZNgsF",
    "email": "aviv@hellolucia.ai",
    "name": "Aviv Hadar"
  },
  "session": {
    "metadata": {
      "id": "session-example",
      "runSource": "custom",
      "status": "ready",
      "reviewLifecycle": {
        "status": "in_review",
        "finalizedAt": null,
        "finalizedBy": null
      }
    },
    "caseOrder": ["case-001"],
    "cases": {
      "case-001": {
        "orderIndex": 0,
        "promptText": "I'm spinning a little. Tell me what to do first so I can breathe again.",
        "luciaResponse": "Take a breath. This feels heavier than it is. Nothing critical is slipping beyond the first move.",
        "runStatus": "success"
      }
    },
    "promptResults": {
      "case-001": {
        "draft": {},
        "saved": null,
        "savedAt": null,
        "savedBy": null
      }
    }
  }
}

Review-layer fields

Prompt reviews now support these additional fields:
ratings
suggestedRatings
keepTalking
suggestedKeepTalking
status
suggestedStatus
priority
suggestedPriority
reviewState
luciaPredictedLabels
humanLabels
adjudication
employeeReview
suggestedEmployeeReview
humanGuidanceEval
suggestedHumanGuidanceEval
canonCandidate
The suggested fields are product suggestions, not final reviewer judgment. They are generated from prompt/response/run-status heuristics and remain separate from the reviewer-saved values. Suggested fields may feed derived context in Registry Diagnostics or Behavioral Observatory, but they do not become persisted Behavioral Observatory labels unless a reviewer saves a label in the Behavioral Observatory surface.

Employee Review object

Employee Review captures guided non-expert signal:
understoodNeed
rightNextMove
calmingEffect
riskOrConfusion
seniorReview
reusableLearning
These fields are intentionally simple and should remain employee-friendly.

Human Guidance Evaluation object

Human Guidance Evaluation captures a 1-5 review layer:
emotionalValidation
cognitiveUnderstanding
actionability
toneAppropriateness
authenticity
notes
Warmth and intelligence are not separate export fields. They are expressed through the current scoring dimensions and guidance fields: tone, calming, naturalness, trust, usefulness, cognitiveUnderstanding, actionability, and authenticity.

Adjudication object

Adjudication captures final senior-review meaning when it exists in the review record:
finalLabels
reason
adjudicator
adjudicatedAt
Final labels may include:
guestIntent
followThroughRequired
actionType
emotionalRead
ownerStressLevel

Review lifecycle object

Run lifecycle finalization is stored at the session level:
status: in_review | ready_to_finalize | finalized
finalizedAt
finalizedBy
Finalization does not replace per-prompt review data. It marks the run lifecycle after all prompts are reviewed.

Supabase persistence contract

The Supabase persistence layer stores the run and item contract in three places:
eval_runs.metadata.reviewLifecycle
eval_runs.metadata.metadata
eval_run_items.payload.case
eval_run_items.payload.promptRecord
eval_item_reviews
eval_run_items.payload.promptRecord embeds the full prompt review record, including saved/draft state, suggested review, employee review, Human Guidance Evaluation, adjudication metadata, canon candidate signal, tester identity, and dirty/completion state. eval_item_reviews is still written for review rows, but hydration prefers the embedded eval_run_items.payload.promptRecord instead of relying on fragile review-table reads. Behavioral Observatory labels are stored separately:
public.eval_behavioral_labels
This table stores first-class Behavioral Observatory labels:
run_id
run_item_id
owner_user_id
reviewer_user_id
intent
guest_affect
response_strategy
humanness
notes
status
payload
created_at
updated_at
The key distinction:
eval_item_reviews = Review Queue review evidence
eval_behavioral_labels = Behavioral Observatory label evidence
One saved Behavioral Observatory label exists per reviewer per run item. Current persisted run evidence is scoped by the signed-in Clerk user and the role claim available to Supabase RLS. Owner/admin can inspect shared persisted evidence where privileged RLS allows it. Evaluator and tester data remains scoped to their own work except where owner/admin oversight applies. Current readiness verification checks counts across:
public.eval_runs
public.eval_run_items
public.eval_item_reviews
For the 60-run readiness gate, the final verified result was:
ready | 60 | 3000 | 3000 | 3000 | 3000
Meaning:
  • 60 ready runs
  • 3,000 expected prompts
  • 3,000 run items
  • 3,000 non-empty Lucia responses
  • 3,000 reviews for the tested reviewer id

localStorage compaction contract

Completed cloud-backed runs should not leave full item-level payloads persisted in localStorage. The platform-readiness diagnostic target is:
persistedLocalFullPayloadSessionCount = 0
persistedLocalHasItemLevelData = false
persistedLocalItemLevelDataSessionCount = 0
ownedSessionCount = expected run count
otherOwnerSessionCount = 0
ownerlessSessionCount = 0
The final verified 60-run diagnostic was:
sessionCount = 60
persistedLocalFullPayloadSessionCount = 0
persistedLocalHasItemLevelData = false
persistedLocalItemLevelDataSessionCount = 0
ownedSessionCount = 60
otherOwnerSessionCount = 0
ownerlessSessionCount = 0
rawByteSize ≈ 68,815
This supports platform readiness and client compactness. It does not prove backend authorization is complete.

Export rule

Exports should preserve the full review contract:
employeeReview = what the reviewer experienced
suggestedEmployeeReview = what the app suggested
humanGuidanceEval = structured 1-5 human guidance
suggestedHumanGuidanceEval = app-suggested human guidance
adjudication = final senior-review metadata when present
reviewLifecycle = whether the run is still in review or finalized
Do not collapse these into one field.