Chapter · AI

Evaluation

How do you grade a model that can do almost anything? The benchmarks, the methodology, the metrics — and why every claim about model capability deserves scrutiny.

Topics

Topic 1

Evaluation Methodology

What does it actually mean to evaluate a model that can do almost anything?

Planned

Topic 2

Benchmarks & Benchmaxxing

The standard tests, what they measure, and how they get gamed.

Planned

Topic 3

LLM-as-Judge

Using a strong model to grade other models — and the biases this introduces.

Planned

Topic 4

Metrics

Accuracy, F1, BLEU, perplexity, pass@k — picking the right one for the task.

Planned

Topic 5

Golden Datasets

The hand-curated test sets that ground every other measurement.

Planned

Topic 6

Data Contamination

When the test set leaks into training, and how to detect it.

Planned