Hacker News · Feb 17, 2026 · Collected from RSS
Andrej Karpathy showed us the GPT algorithm. I wanted to see the hardware limit. The Punchline: I made it go 4,600x faster in pure C code, no dependencies and using a compiler with SIMD auto-vectorisation!!! Andrej recently released microgpt.py - a brilliant, atomic look at the core of a GPT. As a low-latency developer, I couldn't resist seeing how fast it could go when you get closer to the metal. So just for funzies, I spent a few hours building microgpt-c, a zero-dependency and pure C99 implementation featuring: - 4,600x Faster training vs the Python reference (Tested on MacBook Pro M2 Max). On Windows, it is 2,300x faster. - SIMD Auto-vectorisation for high-speed matrix operations. - INT8 Quantisation (reducing weight storage by ~8x). Training is slightly slower, but the storage reduction is significant. - Zero Dependencies - just pure logic. The amalgamation image below is just for fun (and to show off the density!), but the GitHub repo contains the fully commented, structured code for anyone who wants to play with on-device AI. I have started to build something useful, like a simple C code static analyser - I will do a follow-up post. Everything else is just efficiency... but efficiency is where the magic happens Comments URL: https://news.ycombinator.com/item?id=47042014 Points: 31 # Comments: 1
MicroGPT-C A zero-dependency, pure C99 implementation of a GPT-style character-level language model. The algorithm faithfully matches Andrej Karpathy's microgpt.py — same architecture, same training loop, same sampling — but compiles to native code with optional compiler-driven SIMD auto-vectorisation for dramatically faster training and inference. Train a GPT in 20 ms. Generate names in microseconds. No Python. No PyTorch. No GPU. What Is This? MicroGPT-C is a minimal, readable implementation of a GPT (Generative Pre-trained Transformer) — the same family of models behind ChatGPT, but stripped down to its essential algorithm. It trains a tiny character-level language model that learns to generate realistic human names from scratch. The goal is education and experimentation: understand how attention, backpropagation, and the Adam optimiser actually work at the lowest level, without any framework abstractions. Audience Value Students & educators Study attention, softmax, Adam, and backprop in readable C — no framework magic Embedded / edge engineers Entire model fits in < 50 KB RAM; runs on MCUs with no runtime dependencies Researchers Auditable baseline for quantisation, custom layers, or optimiser experiments Rapid prototypers Train → iterate in milliseconds; test tokenisers, vocabularies, data formats Quick Start # Linux / macOS chmod +x build.sh ./build.sh ./build/microgpt :: Windows build.bat build\Release\microgpt.exe The build automatically copies data/names.txt next to the executable. Performance Measured on the same workload (1,000 training steps, 20 inference samples) — C vs the reference Python: Metric Python C (fp64) Speedup Training time ~93 s 0.02 s ~4,600× Training throughput ~0.1 k tok/s ~289 k tok/s ~2,800× Steps/sec ~11 ~40,000 ~3,600× Inference time ~0.74 s < 1 ms ~700×+ Inference rate ~27 samples/s 20,000 samples/s ~740× Token throughput — 109,000 tok/s — INT8 quantised build: ~25% slower training than fp64 on this tiny model, but ~8× smaller weight storage — ideal for constrained devices. Architecture A single-layer, decoder-only Transformer following the GPT-2 design: Input → Token Embed + Pos Embed → RMSNorm → Self-Attention (4 heads, causal) → Residual → RMSNorm → MLP (fc1 → ReLU → fc2, 4× width) → Residual → Linear (lm_head) → Softmax → next-token probabilities Parameter Value Embedding dim 16 Attention heads 4 Layers 1 Context length 16 Total parameters ~4,600 Weight memory (fp64) ~37 KB Weight memory (INT8) ~4.6 KB Training memory ~144 KB Inference memory < 50 KB Training uses the Adam optimiser with linear learning-rate decay (configurable in microgpt.h). Build Options Build scripts (recommended) Platform Standard SIMD (faster) Linux/macOS ./build.sh ./build.sh --simd Windows build.bat build.bat simd SIMD auto-vectorisation The --simd flag enables compiler-driven auto-vectorisation of the core dot products, matrix multiplications, and normalisations. On x86-64 the compiler targets the best available instruction set (SSE4, AVX2, etc.) via -march=native; on MSVC it enables /arch:AVX2. This gives a measurable speed-up on larger models without any hand-written intrinsics — the compiler re-writes the scalar loops into SIMD instructions automatically. # Linux / macOS — auto-detect best ISA ./build.sh --simd # CMake directly cmake -DMICROGPT_SIMD=ON .. cmake --build . --config Release INT8 quantised build Weights are stored as 8-bit integers with per-matrix scales — the forward pass dequantises on the fly; Adam updates an fp64 master copy and requantises each step. This reduces weight storage by ~8× (37 KB → 4.6 KB) at a small accuracy/speed trade-off. Platform Standard SIMD Linux/macOS ./build_quantised.sh ./build_quantised.sh --simd Windows build_quantised.bat build_quantised.bat simd CMake directly mkdir build && cd build cmake .. cmake --build . --config Release # With INT8 quantisation cmake -DQUANTIZATION_INT8=ON .. # With SIMD auto-vectorisation cmake -DMICROGPT_SIMD=ON .. # Both cmake -DQUANTIZATION_INT8=ON -DMICROGPT_SIMD=ON .. Project Layout Path Description microgpt.h Model config, public API declarations microgpt.c Core engine: model, forward/backward, Adam, data loading main.c Entry point: load data → train → generate samples microgpt_amalgamated.c Single-file build — same algorithm, no header needed data/names.txt Training data (one name per line, ~32k names) CMakeLists.txt CMake build (C99, Release, optional SIMD / INT8) Single-File Build microgpt_amalgamated.c is a self-contained single file containing the full GPT algorithm — data loading, training, and inference. No header file needed: # Compile directly (no CMake required) cc -O2 -o microgpt microgpt_amalgamated.c -lm cp data/names.txt . && ./microgpt # Or via CMake cmake --build build --config Release --target microgpt_amalgamated ./build/microgpt_amalgamated Requirements C99 compiler (GCC, Clang, MSVC) CMake 3.10+ No other dependencies License MIT — see LICENSE and source file headers. Author: Ajay Soni (ajay.soni@enjector.com), Enjector Software Ltd.