Chapter · AI

Safety & Alignment

The hardest unsolved problem in the field. Interpretability, red-teaming, jailbreaks, scaling policies — and the open questions about systems we don't yet fully understand.

Topics

Topic 1

The Alignment Problem

Why making powerful systems pursue what we actually want is harder than it sounds.

Planned

Topic 2

Interpretability

Looking inside the network to understand what it's actually doing.

Planned

Topic 3

Sparse Autoencoders

A leading technique for extracting human-readable features from neural activations.

Planned

Topic 4

Red-Teaming

Adversarial testing — finding the failure modes before users do.

Planned

Topic 5

Adversarial Attacks

The mathematical art of crafting inputs that break models.

Planned

Topic 6

Responsible Scaling Policies

Pre-committed safety thresholds tied to capability levels.

Planned

Topic 7

Frontier Safety Frameworks

How labs structure governance over their most capable models.

Planned

Topic 8

Deceptive Alignment & Scheming

The failure mode of a model that passes evals but isn't actually aligned — and what we'd do about it.

Planned