Tag: alignment
All the articles with the tag "alignment".
-
llm-concepts7 min readInterpretability: What's Actually Inside
We can train a 70B model and watch it work. We mostly cannot explain why it works. Interpretability is the science trying to fix that.
-
llm-concepts7 min readModern Alignment: RLHF, DPO, and Constitutional AI
A base model just predicts tokens. Alignment turns it into an assistant that follows instructions and refuses harmful ones.