Tag: alignment

All the articles with the tag "alignment".

llm-concepts
4 May, 2026 7 min read

Interpretability: What's Actually Inside

We can train a 70B model and watch it work. We mostly cannot explain why it works. Interpretability is the science trying to fix that.
llm-concepts
23 Apr, 2026 7 min read

Modern Alignment: RLHF, DPO, and Constitutional AI

A base model just predicts tokens. Alignment turns it into an assistant that follows instructions and refuses harmful ones.