LLM Concepts

A narrow, practical tour of how 2026 LLMs actually work — tokens to benchmarks to local models to personal AI.

22 articles published

1

Welcome to LLM Concepts: How the Machine Behind Your Chatbot Actually Works

A 27-article series on how 2026 LLMs actually work, written for curious readers who use ChatGPT, Claude, or Gemini and want to know what is happening inside.
2

Before the Transformer: A Short History of Machines That Read

Why did the transformer matter so much that we measure AI in 'before' and 'after' it? A short history of every approach that tried and hit a wall first.
3

The Transformer: How Attention Solved the Problem Everything Else Could Not

In 2017, eight researchers replaced the entire approach to language modeling with a single idea: let every word attend to every other word directly.
4

Tokens and Embeddings: How Raw Text Becomes Numbers the Model Can Use

Before the transformer can do anything, it must turn your prompt into numbers. Here is exactly how that works, from raw characters to dense vectors.
5

Positional Encoding and Sampling: How the Transformer Finds Position and Picks Its Next Word

Attention cannot tell 'the dog bit the man' from 'the man bit the dog.' Positional encoding fixes that. Then sampling decides what word the model actually says.
6

Context Windows: Why Your AI Has a Working Memory Limit

Context windows are not memory. They are working memory. Here is what the model can see right now, why extending that limit is hard, and what it costs to try.
7

Parameter Counts and Scaling Laws: What 70B Actually Means

What does 70B actually mean? It tells you about memory requirements, inference speed, and training costs, but almost nothing about model quality on its own.
8

Mixture of Experts: Why 671B Does Not Equal 671B

A 671B Mixture of Experts model can be faster and cheaper to run than a dense 70B. The headline parameter count stopped meaning what it used to mean.
9

Modern Alignment: RLHF, DPO, and Constitutional AI

A base model just predicts tokens. Alignment turns it into an assistant that follows instructions and refuses harmful ones.
10

Reasoning Models: Chain-of-Thought and Test-Time Compute

Reasoning models do not have a new architecture. They have a new training recipe and permission to think for longer before answering.
11

Multimodality: Teaching Models to See and Hear

A multimodal model is not many models in a trench coat. It is one transformer trained to treat pixels, audio, and text as the same kind of thing.
12

The 2026 Model Lineup: Who Ships What

A field guide to the 2026 frontier and open-weight model field, and a practical way to think about which model to actually pick.
13

Hallucinations and Jailbreaks: The Two Ways LLMs Fail

LLMs produce confident wrong answers and can be tricked into ignoring safety rules. What is actually happening and why both failures are hard to fix.
14

Benchmarks: How Labs Measure Intelligence (and the Games They Play)

Every model launch comes with a chart. The numbers look big. What benchmarks actually measure, what they miss, and how labs game them.
15

Interpretability: What's Actually Inside

We can train a 70B model and watch it work. We mostly cannot explain why it works. Interpretability is the science trying to fix that.
16

Prompting and RAG: The Two Levers You Actually Pull

Most teams will never train a model. Most teams will spend a lot of time on prompts and retrieval. What the practical 2026 stack actually looks like.
17

Tool Use, Function Calling, and MCP: How a Chatbot Became an Agent

Tools turn a chatbot into an agent. What function calling actually is, why MCP changed the rules, and the loop that makes a model do work.
18

Fine-tuning with LoRA: When to Change the Model, Not the Prompt

When does fine-tuning beat prompting, and what does LoRA actually cost? The decision ladder that saves most teams from training a model they did not need.
19

Quantization: How a 70B Model Fits on Your Laptop

Quantization shrinks a 70B model from 140 GB to 20 GB with almost no quality loss. What it actually does, and why the trick works.
20

Running Local Models: What It Actually Takes

Quantization shrank the model down to 40 GB. Now what hardware, what software, and what setup actually run a 70B model at home in 2026?
21

Personal AI Infrastructure: The Climb From Chatbot to Assistant

Most AI tools today are chatbots in a costume. The real ladder has three tiers, and 2026 finally made the top one buildable at home.
22

PAI in the Wild: Three Ways to Build Your Own

Three real 2026 options for building a personal AI: Kai (the framework I use), Clawbot (integrations-first), and Hermes (self-improving).

6 more articles coming soon.

LLM Concepts

Welcome to LLM Concepts: How the Machine Behind Your Chatbot Actually Works

Before the Transformer: A Short History of Machines That Read

The Transformer: How Attention Solved the Problem Everything Else Could Not

Tokens and Embeddings: How Raw Text Becomes Numbers the Model Can Use

Positional Encoding and Sampling: How the Transformer Finds Position and Picks Its Next Word

Context Windows: Why Your AI Has a Working Memory Limit

Parameter Counts and Scaling Laws: What 70B Actually Means

Mixture of Experts: Why 671B Does Not Equal 671B

Modern Alignment: RLHF, DPO, and Constitutional AI

Reasoning Models: Chain-of-Thought and Test-Time Compute

Multimodality: Teaching Models to See and Hear

The 2026 Model Lineup: Who Ships What

Hallucinations and Jailbreaks: The Two Ways LLMs Fail

Benchmarks: How Labs Measure Intelligence (and the Games They Play)

Interpretability: What's Actually Inside

Prompting and RAG: The Two Levers You Actually Pull

Tool Use, Function Calling, and MCP: How a Chatbot Became an Agent

Fine-tuning with LoRA: When to Change the Model, Not the Prompt

Quantization: How a 70B Model Fits on Your Laptop

Running Local Models: What It Actually Takes

Personal AI Infrastructure: The Climb From Chatbot to Assistant

PAI in the Wild: Three Ways to Build Your Own