Production AI Institute — vendor-neutral certification for AI practitioners

Verify a credential For organisations For MSPs & partners Nonprofits & NGOs Contact

FoundationsMay 2026

What microgpt Reveals About LLMs
200 Lines That Explain Everything

Andrej Karpathy — co-founder of OpenAI, former Director of AI at Tesla — just built a complete GPT with no libraries, no frameworks, and no dependencies. 200 lines of pure Python. Here is what those lines actually tell you about how language models work.

Production AI Institute — Foundations · Based on Karpathy's microgpt (gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95) · Published May 2026 · CC BY 4.0

Why 200 lines matters

Every AI course in the world teaches through abstraction. You use PyTorch. You import transformers. You call functions you do not understand. You build things without knowing how they work. Karpathy's entire career has been a war against that approach.

He previously built micrograd (automatic differentiation from scratch), makemore (character-level language models from scratch), and nanoGPT (a full GPT-2 training run from scratch). Each was a step toward stripping AI down to its mathematical skeleton.

microgpt is the final answer. It trains and runs a GPT model completely from scratch, with no external dependencies, in 200 lines of Python. Karpathy wrote: “This script is the culmination of multiple projects and a decade-long obsession to simplify LLMs to their bare essentials. I cannot simplify this any further.”

What 200 lines contains

✓ A full dataset loader

✓ A tokenizer

✓ An autograd engine that computes gradients

✓ A GPT-2 architecture neural network

✓ The Adam optimizer

✓ A complete training loop

✓ A complete inference loop

This matters even if you will never read the code. Because once you understand what each of those components does — in plain language — you understand what a language model actually is. Not what it does. What it is. That understanding changes how you use it, how you evaluate it, and how you spot when it is going wrong.

The 7 components, explained without code

Each component is described in plain language with an analogy. The line counts are approximate.

1. The dataset loader

~15 lines

Reads raw text — a book, a web page, a code repository — and converts it into numbers. Language models do not read words. They read numbers that represent words. The dataset loader builds the conversion table.

Analogy

Think of it like a music sampler converting a live performance into a digital file. The sound becomes numbers. The model learns the patterns in the numbers, not the sound itself.

2. The tokenizer

~20 lines

Decides how to split text into chunks ('tokens') and assigns each chunk a number. 'hello' might be one token. ' world' might be another. Punctuation gets its own tokens. The tokenizer determines the vocabulary the model can work with.

Analogy

A tokenizer is like a cashier who has to log every purchase. Rather than writing out 'one bottle of shampoo', they scan a barcode. The barcode is the token. The scanner is the tokenizer.

3. The autograd engine

~30 lines

The mechanism by which the model learns. After making a prediction, autograd calculates how wrong it was and works backwards through the entire network to figure out which numbers to adjust and by how much. This is the heart of machine learning.

Analogy

Imagine you throw a dart and miss. Autograd is the process of figuring out: was it your grip? Your stance? Your release point? And then adjusting each factor by the precise amount that would bring you closer to the target.

4. The GPT-2 architecture

~50 lines

The actual neural network structure — the layers, the attention heads, the matrix multiplications that transform an input into a probability distribution over the next token. This is the 'brain' of the model.

Analogy

A GPT-2 architecture is like a very sophisticated autocomplete. But instead of suggesting one word, it assigns a probability to every word in its vocabulary. Then it picks — sometimes the highest-probability word, sometimes not, which is why it can surprise you.

5. The Adam optimizer

~20 lines

The algorithm that uses the autograd calculations to actually update the model's parameters. Adam is sophisticated: it adjusts how aggressively it updates each parameter based on how often that parameter has been useful.

Analogy

If autograd identifies the problems, Adam is the physical therapist who designs the correction program. Not one exercise for every problem — a tailored approach for each, adjusted based on progress.

6. The training loop

~25 lines

The cycle: show the model a batch of text, have it predict the next token, calculate how wrong it was, update the parameters. Repeat millions of times. This is how a blank model becomes a model that can write.

Analogy

Training a model is like teaching someone to drive through a simulation. Show scenario. Observe response. Correct errors. Repeat until the responses are reliably good.

7. The inference loop

~20 lines

How you actually use the trained model. Give it a starting prompt. It predicts the next token. You add that token to the prompt and ask again. Repeat until you have the output you wanted.

Analogy

Inference is autocomplete that never runs out of suggestions. You give it a few words and it completes the sentence. Then the paragraph. Then the document. One token at a time.

What this tells us about production AI

Understanding microgpt changes how you think about deploying AI in production — even if you will never write a single line of the training code yourself.

Models are pattern completers, not fact databases

The inference loop works by asking 'what comes next?' — repeatedly. There is no memory of absolute facts. There is only learned statistical association. When a model hallucinates, it is not lying. It is pattern-completing from a direction it was not fully trained for. This is why output validation (PSF D2) is not optional.

Training data is everything

The dataset loader is line one. The model is a compression of its training data. If the training data is biased, incomplete, or stale, the model will be biased, incomplete, or stale. There is no magic that happens after training to correct this. This is why data stewardship (PSF D3) is foundational, not cosmetic.

Confidence is not knowledge

The GPT-2 architecture outputs a probability distribution over the next token. A model that is wrong can still be confident. Confidence in an output tells you about the model's training distribution, not about the truth of the output. This is why human oversight (PSF D6) is a structural requirement, not a temporary workaround.

Every call is stateless

The inference loop starts fresh every time. The model does not remember your previous conversations unless you include them in the context. Memory patterns (RAG, context management) are architectural choices made at the application layer, not at the model layer. This is why assuming a model 'knows' something from a prior session is a common and costly mistake.

Where to go next

If this article sparked genuine curiosity about how language models work — not at the level of building one from scratch, but at the level of understanding what is actually happening when you prompt one — the AI Foundations track is the right next step.

If you want to go deeper into the code, Karpathy's microgpt is public and free: gist.github.com/karpathy/microgpt. You do not need to understand Python to benefit from reading it — the structure alone communicates how the pieces fit together.

AI Foundations Track →

From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential

The Production AI Brief