Hacker News · Mar 1, 2026 · Collected from RSS
Article URL: http://karpathy.github.io/2026/02/12/microgpt/ Comments URL: https://news.ycombinator.com/item?id=47202708 Points: 225 # Comments: 25
This is a brief guide to my new art project microgpt, a single file of 200 lines of pure Python with no dependencies that trains and inferences a GPT. This file contains the full algorithmic content of what is needed: dataset of documents, tokenizer, autograd engine, a GPT-2-like neural network architecture, the Adam optimizer, training loop, and inference loop. Everything else is just efficiency. I cannot simplify this any further. This script is the culmination of multiple projects (micrograd, makemore, nanogpt, etc.) and a decade-long obsession to simplify LLMs to their bare essentials, and I think it is beautiful 🥹. It even breaks perfectly across 3 columns: Where to find it: This GitHub gist has the full source code: microgpt.py It’s also available on this web page: https://karpathy.ai/microgpt.html Also available as a Google Colab notebook The following is my guide on stepping an interested reader through the code. Dataset The fuel of large language models is a stream of text data, optionally separated into a set of documents. In production-grade applications, each document would be an internet web page but for microgpt we use a simpler example of 32,000 names, one per line: # Let there be an input dataset `docs`: list[str] of documents (e.g. a dataset of names) if not os.path.exists('input.txt'): import urllib.request names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt' urllib.request.urlretrieve(names_url, 'input.txt') docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()] # list[str] of documents random.shuffle(docs) print(f"num docs: {len(docs)}") The dataset looks like this. Each name is a document: emma olivia ava isabella sophia charlotte mia amelia harper ... (~32,000 names follow) The goal of the model is to learn the patterns in the data and then generate similar new documents that share the statistical patterns within. As a preview, by the end of the script our model will generate (“hallucinate”!) new, plausible-sounding names. Skipping ahead, we’ll get: sample 1: kamon sample 2: ann sample 3: karai sample 4: jaire sample 5: vialan sample 6: karia sample 7: yeran sample 8: anna sample 9: areli sample 10: kaina sample 11: konna sample 12: keylen sample 13: liole sample 14: alerin sample 15: earan sample 16: lenne sample 17: kana sample 18: lara sample 19: alela sample 20: anton It doesn’t look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny looking “document”. When you initialize the document with your prompt, the model’s response from its perspective is just a statistical document completion. Tokenizer Under the hood, neural networks work with numbers, not characters, so we need a way to convert text into a sequence of integer token ids and back. Production tokenizers like tiktoken (used by GPT-4) operate on chunks of characters for efficiency, but the simplest possible tokenizer just assigns one integer to each unique character in the dataset: # Let there be a Tokenizer to translate strings to discrete symbols and back uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1 BOS = len(uchars) # token id for the special Beginning of Sequence (BOS) token vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS print(f"vocab size: {vocab_size}") In the code above, we collect all unique characters across the dataset (which are just all the lowercase letters a-z), sort them, and each letter gets an id by its index. Note that the integer values themselves have no meaning at all; each token is just a separate discrete symbol. Instead of 0, 1, 2 they might as well be different emoji. In addition, we create one more special token called BOS (Beginning of Sequence), which acts as a delimiter: it tells the model “a new document starts/ends here”. Later during training, each document gets wrapped with BOS on both sides: [BOS, e, m, m, a, BOS]. The model learns that BOS initates a new name, and that another BOS ends it. Therefore, we have a final vocavulary of 27 (26 possible lowercase characters a-z and +1 for the BOS token). Autograd Training a neural network requires gradients: for each parameter in the model, we need to know “if I nudge this number up a little, does the loss go up or down, and by how much?”. The computation graph has many inputs (the model parameters and the input tokens) but funnels down to a single scalar output: the loss (we’ll define exactly what the loss is below). Backpropagation starts at that single output and works backwards through the graph, computing the gradient of the loss with respect to every input. It relies on the chain rule from calculus. In production, libraries like PyTorch handle this automatically. Here, we implement it from scratch in a single class called Value: class Value: __slots__ = ('data', 'grad', '_children', '_local_grads') def __init__(self, data, children=(), local_grads=()): self.data = data # scalar value of this node calculated during forward pass self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass self._children = children # children of this node in the computation graph self._local_grads = local_grads # local derivative of this node w.r.t. its children def __add__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data + other.data, (self, other), (1, 1)) def __mul__(self, other): other = other if isinstance(other, Value) else Value(other) return Value(self.data * other.data, (self, other), (other.data, self.data)) def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),)) def log(self): return Value(math.log(self.data), (self,), (1/self.data,)) def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),)) def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),)) def __neg__(self): return self * -1 def __radd__(self, other): return self + other def __sub__(self, other): return self + (-other) def __rsub__(self, other): return other + (-self) def __rmul__(self, other): return self * other def __truediv__(self, other): return self * other**-1 def __rtruediv__(self, other): return other * self**-1 def backward(self): topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._children: build_topo(child) topo.append(v) build_topo(self) self.grad = 1 for v in reversed(topo): for child, local_grad in zip(v._children, v._local_grads): child.grad += local_grad * v.grad I realize that this is the most mathematically and algorithmically intense part and I have a 2.5 hour video on it: micrograd video. Briefly, a Value wraps a single scalar number (.data) and tracks how it was computed. Think of each operation as a little lego block: it takes some inputs, produces an output (the forward pass), and it knows how its output would change with respect to each of its inputs (the local gradient). That’s all the information autograd needs from each block. Everything else is just the chain rule, stringing the blocks together. Every time you do math with Value objects (add, multiply, etc.), the result is a new Value that remembers its inputs (_children) and the local derivative of that operation (_local_grads). For example, __mul__ records that \(\frac{\partial(a \cdot b)}{\partial a} = b\) and \(\frac{\partial(a \cdot b)}{\partial b} = a\). The full set of lego blocks: Operation Forward Local gradients a + b \(a + b\) \(\frac{\partial}{\partial a} = 1, \quad \frac{\partial}{\partial b} = 1\) a * b \(a \cdot b\) \(\frac{\partial}{\partial a} = b, \quad \frac{\partial}{\partial b} = a\) a ** n \(a^n\) \(\frac{\partial}{\partial a} = n \cdot a^{n-1}\) log(a) \(\ln(a)\) \(\frac{\partial}{\partial a} = \frac{1}{a}\) exp(a) \(e^a\) \(\frac{\partial}{\partial a} = e^a\) relu(a) \(\max(0, a)\) \(\frac{\partial}{\partial a} = \mathbf{1}_{a > 0}\) The backward() method walks this graph in reverse topological order (starting from the loss, ending at the parameters), applying the chain rule at each step. If the loss is \(L\) and a node \(v\) has a child \(c\) with local gradient \(\frac{\partial v}{\partial c}\), then: \[\frac{\partial L}{\partial c} \mathrel{+}= \frac{\partial v}{\partial c} \cdot \frac{\partial L}{\partial v}\] This looks a bit scary if you’re not comfortable with your calculus, but this is literally just multiplying two numbers in an intuitive way. One way to see it looks as follows: “If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 x 4 = 8 times as fast as the man.” The chain rule is the same idea: you multiply the rates of change along the path. We kick things off by setting self.grad = 1 at the loss node, because \(\frac{\partial L}{\partial L} = 1\): the loss’s rate of change with respect to itself is trivially 1. From there, the chain rule just multiplies local gradients along every path back to the parameters. Note the += (accumulation, not assignment). When a value is used in multiple places in the graph (i.e. the graph branches), gradients flow back along each branch independently and must be summed. This is a consequence of the multivariable chain rule: if \(c\) contributes to \(L\) through multiple paths, the total derivative is the sum of contributions from each path. After backward() completes, every Value in the graph has a .grad containing \(\frac{\partial L}{\partial v}\), which tells us how the final loss would change if we nudged that value. Here’s a concrete example. Note that a is used twice (the graph branches), so its gradient is the sum of both paths: a = Value(2.0) b = Value(3.0) c = a * b # c = 6.0 L = c + a # L = 8.0 L.backward() print(a.grad) # 4.0 (dL/da = b + 1 = 3 + 1, via both paths) print(b.grad) # 2.0 (dL/db = a = 2) This is exactly what PyTorch’s .backward() gives you: import tor