Evaluation flow

Measure assistant quality before optimizing product behavior.

Researcher illustration
HAI evaluation guide
User iconScope
TasksUsersContext
Agent iconMetrics
SuccessTimeTrust
Output iconAction
PrioritizeFixRe-measure
Practitioner lane~6 min

Assistant Evaluation Basics (3-metric core)

Use this to get a fast, defensible quality baseline before optimization.

Use this when: you need a quality baseline ยท Youโ€™ll leave with: 3 metrics that align teams ยท Fun cue: run it like a mini lab experiment

Metrics

  1. Task success rate โ€” % completed without escalation.
  2. Time to correct outcome โ€” median time from prompt to usable result.
  3. User trust rating โ€” confidence score after each task.

Protocol

  • Pick 3 representative user tasks.
  • Run 5โ€“8 moderated sessions.
  • Capture handoff failures and correction loops.
  • Prioritize fixes by impact on success + trust.

Output template

Task | Success | Time | Trust | Failure mode | Next fix