AI learning guide

Best AI resources for evals and reliability

Learn test sets, traces, prompt regression tests, and quality measurement.

Best practical eval writing guide: LLM Evals. Hamel Husain's guide to writing useful AI evaluations. Start here if you need to measure quality instead of arguing from anecdotes.

Best agent-specific course: Evaluating AI Agents. DeepLearning.AI course focused on testing and improving multi-step agent workflows. Use it when the thing being evaluated calls tools or takes several steps.

Best tracing tool path: Phoenix by Arize. Open-source tracing and evaluation tooling from Arize AI. Use it when you need to inspect what happened during an LLM or agent run.

Evals are how you stop guessing

If an AI feature matters, someone must define what good output looks like. Evals turn that judgement into examples, criteria, traces, graders, and regression checks. Without them, teams end up arguing from anecdotes.

Hamel Husain's LLM Evals material is the best general starting point. Evaluating AI Agents is better when the workflow has multiple steps. Phoenix helps when you need to inspect traces rather than only score final text.

Measure the workflow, not only the answer

For agents, RAG, coding tools, and research workflows, the final answer is not enough. You also need to check source use, tool calls, intermediate decisions, refusals, clarification behavior, and whether the system stopped at the right time.

A good eval resource should help you build small representative datasets before the feature ships. Waiting until after launch usually turns evaluation into damage control.

Recommended courses and resources

  1. AI SDK v6 Crash Course

    Workshop · Matt Pocock · Intermediate

    You want a structured AI SDK v6 course that covers model choice, text and object generation, UI streams, agents, persistence, context engineering, evals, and advanced app patterns.

  2. The AI Engineer Roadmap

    Free tutorial · Matt Pocock · Beginner to intermediate

    You want a guided path through core AI concepts, model selection, the AI engineering mindset, evals, and techniques for improving LLM-powered apps.

  3. LLM Evals

    Guide · Hamel Husain · Intermediate

    Your AI app needs quality checks before users see it.

  4. Evaluating AI Agents

    Short course · DeepLearning.AI · Intermediate

    You need to test, trace, and improve agent workflows instead of judging only single LLM responses.

  5. Building and Evaluating Advanced RAG Applications

    Short course · DeepLearning.AI · Intermediate

    You already know basic RAG and need better retrieval, evaluation, and production-quality patterns.

Roll a learning mission

Pick one small move from this guide instead of opening ten tabs.

About this guide

Author: Learnetto Editorial Team. Learnetto maintains this AI learning directory by organizing public course pages, official documentation, educator material, and practical learning resources.

How it is made: Learnetto uses public course pages, official documentation, educator material, and directory data to compile these recommendations. AI may help draft and organize the page, but recommendations are checked against the listed sources, page topic, and learner intent.

Review policy: We only add a named personal reviewer when that person has substantially reviewed the page. Until then, the page is attributed to Learnetto rather than a founder, editor, or individual expert.

Last updated: June 18, 2026. Suggest a correction if a course, doc, or recommendation is outdated.

Videos to watch

LLM evaluation with W&B video thumbnail

LLM evaluation with W&B

Weights & Biases

AI evals with Phoenix video thumbnail

AI evals with Phoenix

Arize AI

Promptfoo red teaming video thumbnail

Promptfoo red teaming

Promptfoo

Educators and sources

Educator / source Best for Skills Start with
Builders shipping LLM systems Evals, RAG, LLM product quality Read the evals guide and build a small test set for your own app.
Engineers, PMs, AI product teams Evals, LLM reliability, Product quality Review the course outcomes and pair it with a real feature you can evaluate.
Developers and self-directed learners building with AI coding agents AI coding, Claude Skills, Agentic workflows, AI SDK, MCP, LLM fundamentals, Personalized learning Use LLM Fundamentals or the AI Engineer Roadmap if you need concepts, the Vercel AI SDK Tutorial or AI SDK v6 Crash Course if you want to build apps, and the AI Skills catalog if you want practical agent workflows like /teach, /grill-me, /tdd, and /triage.
Product managers, AI product leaders, founders Agentic AI, AI product strategy, Evals, Production AI Use the course to evaluate one AI product opportunity and define what reliability would mean before implementation.

Resources

AI SDK v6 Crash Course

Workshop · Matt Pocock · Intermediate

You want a structured AI SDK v6 course that covers model choice, text and object generation, UI streams, agents, persistence, context engineering, evals, and advanced app patterns.

The AI Engineer Roadmap

Free tutorial · Matt Pocock · Beginner to intermediate

You want a guided path through core AI concepts, model selection, the AI engineering mindset, evals, and techniques for improving LLM-powered apps.

LLM Evals

Guide · Hamel Husain · Intermediate

Your AI app needs quality checks before users see it.

Evaluating AI Agents

Short course · DeepLearning.AI · Intermediate

You need to test, trace, and improve agent workflows instead of judging only single LLM responses.

OpenAI Working with evals

Guide · OpenAI · Intermediate

You need API-level guidance for testing outputs, comparing models, and catching regressions during upgrades.

OpenAI Evaluate agent workflows

Guide · OpenAI · Intermediate

You need the current OpenAI path for tracing, grading, and regression-testing agent workflows instead of only single-prompt evals.

OpenAI model optimization

Guide · OpenAI · Intermediate

You need a practical optimization loop across prompt changes, evals, and fine-tuning rather than guessing which knob to turn next.

Phoenix by Arize

Open source tool and docs · Arize AI · Intermediate

You need to trace, inspect, and evaluate LLM app behavior.

Langfuse Docs

Docs and cookbooks · Langfuse · Intermediate

You need production LLM tracing, scoring, and prompt operations.

Promptfoo Intro

Open source docs · Promptfoo · Intermediate

You need regression tests for prompts, models, and LLM outputs.

AI Evals for Engineers & PMs

Cohort course · Hamel Husain and Shreya Shankar · Intermediate

You are shipping AI features and need a serious evaluation workflow.

AI Evals for Engineers and PMs

Course · Shreya Shankar · Intermediate

Use this when you want Shreya Shankar's material for evals and related AI skills.

Hamza Farooq on Maven

Maven cohort course · Agentic AI for Product Managers · Beginner to intermediate

Use this when you want Agentic AI for Product Managers's material for agentic ai and related AI skills.