Best practical eval writing guide: LLM Evals. Hamel Husain's guide to writing useful AI evaluations. Start here if you need to measure quality instead of arguing from anecdotes.
Best agent-specific course: Evaluating AI Agents. DeepLearning.AI course focused on testing and improving multi-step agent workflows. Use it when the thing being evaluated calls tools or takes several steps.
Best tracing tool path: Phoenix by Arize. Open-source tracing and evaluation tooling from Arize AI. Use it when you need to inspect what happened during an LLM or agent run.
Evals are how you stop guessing
If an AI feature matters, someone must define what good output looks like. Evals turn that judgement into examples, criteria, traces, graders, and regression checks. Without them, teams end up arguing from anecdotes.
Hamel Husain's LLM Evals material is the best general starting point. Evaluating AI Agents is better when the workflow has multiple steps. Phoenix helps when you need to inspect traces rather than only score final text.
Measure the workflow, not only the answer
For agents, RAG, coding tools, and research workflows, the final answer is not enough. You also need to check source use, tool calls, intermediate decisions, refusals, clarification behavior, and whether the system stopped at the right time.
A good eval resource should help you build small representative datasets before the feature ships. Waiting until after launch usually turns evaluation into damage control.
Recommended courses and resources
-
AI SDK v6 Crash Course
Workshop · Matt Pocock · Intermediate
You want a structured AI SDK v6 course that covers model choice, text and object generation, UI streams, agents, persistence, context engineering, evals, and advanced app patterns.
-
The AI Engineer Roadmap
Free tutorial · Matt Pocock · Beginner to intermediate
You want a guided path through core AI concepts, model selection, the AI engineering mindset, evals, and techniques for improving LLM-powered apps.
-
LLM Evals
Guide · Hamel Husain · Intermediate
Your AI app needs quality checks before users see it.
-
Evaluating AI Agents
Short course · DeepLearning.AI · Intermediate
You need to test, trace, and improve agent workflows instead of judging only single LLM responses.
-
Building and Evaluating Advanced RAG Applications
Short course · DeepLearning.AI · Intermediate
You already know basic RAG and need better retrieval, evaluation, and production-quality patterns.