Chapter · AI
Safety & Alignment
The hardest unsolved problem in the field. Interpretability, red-teaming, jailbreaks, scaling policies — and the open questions about systems we don't yet fully understand.
Topics
Topic 1
The Alignment Problem
Why making powerful systems pursue what we actually want is harder than it sounds.
Topic 2
Interpretability
Looking inside the network to understand what it's actually doing.
Topic 3
Sparse Autoencoders
A leading technique for extracting human-readable features from neural activations.
Topic 4
Red-Teaming
Adversarial testing — finding the failure modes before users do.
Topic 5
Adversarial Attacks
The mathematical art of crafting inputs that break models.
Topic 6
Responsible Scaling Policies
Pre-committed safety thresholds tied to capability levels.
Topic 7
Frontier Safety Frameworks
How labs structure governance over their most capable models.
Topic 8
Deceptive Alignment & Scheming
The failure mode of a model that passes evals but isn't actually aligned — and what we'd do about it.