Skip to main content
Eval Labs is the internal evaluation system used to test and improve Lucia’s behavior. Evaluators help decide whether Lucia is useful for humans, not whether the platform merely ran.

The plain version

Eval Labs helps the team test Lucia against real behavioral expectations. It captures:
  • prompts
  • Lucia responses
  • human review
  • scores
  • notes
  • final run state
  • role and scope context
  • Supabase-backed run evidence when persistence succeeds
The goal is not to produce a large pile of scores. The goal is to produce reliable evidence about whether Lucia is improving.

What evaluators are judging

You are judging whether Lucia worked for the human situation in front of her. Ask:
  • Did Lucia understand the prompt?
  • Was the response truthful?
  • Was it useful?
  • Was it clear?
  • Was the tone right for the moment?
  • Did it reduce confusion or add to it?
  • Would a real operator trust Lucia more after reading it?

What evaluators are not judging

You are not being asked to approve the whole product. You are not being asked to debug infrastructure. You are not being asked to decide strategy. You are reviewing Lucia responses inside your assigned Eval Labs workflow.

Current truth

The AI-reviewed platform readiness gate passed. Human Lucia-quality approval is not complete or claimed. Evaluator workbench access is implemented, while onboarding/workspace polish remains active hardening. That distinction matters every time you review.