A PM's Guide to Building with LLMs, Part 3: Why You Need Good Evaluations.

LLM outputs don't have the same level of determinism that you find in traditional software engineering. Here's why it's important to be able to tell the good from the bad. Disclaimer: There will be code.

Pieter van Noordennen | Jan 20, 2024

A PM's Guide to Building with LLMs, Part 3: Why You Need Good Evaluations.

Basic Building Blocks
Advanced Techniques: Agents, Chains, and Retrieval
Model Evaluation and Why You Need It (this article)
Model Eval Tools and Techniques
Pre-Production QA
Post-Production LLM Observability

Greg Brockman on Twitter

Evaluations (shorthand: “Evals”) are how AI/ML data scientists talk about the tactics and structures they use to measure how good a certain model is at a certain task.

At a high level, evaluations are a set of questions with a known, “ground truth” answer that can be verfied programmatically.

The ground truth data is usually human-labelled data that was used in training or held-out from training data (read up on “test-train” splits if you want to know more).

We take the predicted output and compare it to the ground truth output, and voila, we have an accuracy score.

This is trickier in LLM development, where often we aren’t trying to predict some numerical value like Credit Score, but rather a humanastic answer with varying degrees of nuance and creativity.

Enter the world of Evals.

But first…

I’d like to dedicate a full post introducing the technical underpinnings and practical applications of Evals for product managers.

Before getting into all that, though, I want to talk about why Evals matter — and are critical — in LLM development.