Illustrated flow
Researcher illustration
HAI researcher
User iconUser thread
IntentConstraintsTone
Agent iconAgent thread
PlanValidateRepair
Output iconOutput thread
Next stepChecklistAction
Practitioner lane~6 min

Assistant Evaluation Basics (3-metric core)

Use this to get a fast, defensible quality baseline before optimization.

Metrics

  1. Task success rate — % completed without escalation.
  2. Time to correct outcome — median time from prompt to usable result.
  3. User trust rating — confidence score after each task.

Protocol

  • Pick 3 representative user tasks.
  • Run 5–8 moderated sessions.
  • Capture handoff failures and correction loops.
  • Prioritize fixes by impact on success + trust.

Output template

Task | Success | Time | Trust | Failure mode | Next fix