
Hacker News · Mar 1, 2026 · Collected from RSS
Article URL: https://taylor.town/scrapscript-001 Comments URL: https://news.ycombinator.com/item?id=47207531 Points: 8 # Comments: 1
I'm still thinking about those lil' fun langs. How do they work? What's inside them? Do I need my pancreas? What if I don't want to normalize my IR? Is laziness a virtue? Haskell-esque languages may look alike, but they differ across many dimensions: strict vs. lazy curried vs. bland bootstrapped vs. hosted interpreted vs. compiled nominal vs. structural types pretty vs. ugly errors Most implementations use standard compilation phases: Lexing: Source → Token stream Parsing: Tokens → Surface AST Desugaring: Surface AST → Core AST Type Inference: Core AST → Typed AST Pattern Match Compile: Typed AST → Case trees Normalization (ANF/K): Typed AST → Normalized IR Optimization: Normalized IR → Normalized IR Closure Conversion: Normalized IR → Closure-explicit IR Code Generation: Closure IR → Target (asm/bytecode/C/LLVM) Register Allocation: Virtual regs → Physical regs (if native) Runtime System: GC, primitives, entry point Strict vs. Lazy In strict evaluation, arguments are evaluated before being passed to a function. In lazy evaluation, arguments are only evaluated if their value is actually needed; the result is cached, so the work happens at most once. -- lazy eval returns `3` without applying `foo` length [ 1, foo 2, 4 ] Aspect Strict (ML, OCaml) Lazy (Haskell) Normalization ANF / K-normal form STG / thunks required Closure conversion Standard flat closures Closures + thunks + update frames Code generation Straightforward Requires eval/apply or push/enter Memory management Values are always evaluated May contain unevaluated thunks Tail calls Simple (jump) Complex (enters, updates) Debugging Easy (call stack is meaningful) Hard (thunks obscure control flow) Runtime complexity Simpler (~200 LOC C) More complex (~500–2000 LOC C) Strict evaluation is the simple choice. If you want laziness, Peyton Jones's STG machine is the standard approach. MicroHs sidesteps the STG machine by compiling directly to combinatory logic with graph reduction. Lazy evaluation also unlocks infinite collections — you can define an infinite list and consume only what you need. Curried vs. Bland Style Examples Implementation cost Curried Haskell, Ben Lynn, MicroHs Free in combinator backends; native backends need arity analysis to avoid allocating a closure per argument Bland MinCaml, OCaml (internally), Grace, EYG Simpler codegen -- multi-arg functions are just functions that take tuples or multiple params In a curried language, f x y is ((f) x) y: two function applications. If your backend doesn't detect that f always takes two arguments (arity analysis), you pay for a heap allocation on every multi-argument call. Bootstrapped vs. Hosted I tried to teach myself to play the guitar. But I'm a horrible teacher — because I do not know how to play a guitar. — Mitch Hedberg Most compilers are written in an existing language (e.g. C, Rust, Haskell, OCaml) and lean on that host's ecosystem for parsing libraries, build tools, and package management. A bootstrapped compiler compiles itself. You write the compiler in the language it compiles, then use an earlier version of the compiler (or a minimal seed runtime) to build the next version. Your language becomes self-sustaining; the compiler is its own test suite. There are many exemplary self-hosted languages to study: MicroHs is a Haskell compiler that compiles Haskell to combinators. The combinator reducer is implemented in C. The compiler is written in Haskell and can compile itself. Bootstrapping requires only a C compiler -- no pre-existing Haskell installation. Ben Lynn starts with a minimal runtime in C (~350 LOC), then constructs increasingly capable compilers, each written in the subset that the previous one can compile. Each stage is ~100–300 LOC of the language being defined. The total chain is ~2000 LOC + 350 LOC C. C runtime (350 LOC) → compiler₁: lambda calculus + integers → compiler₂: + let, letrec, ADTs → compiler₃: + type inference → compiler₄: + pattern matching → compiler₅: + type classes → ... → compilerₙ: near-Haskell-98 Newt is a dependently typed language whose compiler is written in Newt, targeting JavaScript. It bootstraps by keeping the generated JS checked in. This works best when your target is a high-level runtime (JS, JVM) rather than native code. Interpreted vs. Compiled An interpreter executes the program directly by walking its AST or stepping through bytecode. A compiler translates the program into another language (e.g. x86, C, JS) and lets that target handle execution. The boundary here is blurry. Bytecode VMs compile to an intermediate form. "Transpilers" compile to source code rather than machine instructions. Strategy Examples LOC estimate Trade-off Tree-walking interpreter PLZoo poly, Eff, Frank, Grace, 1ML 50–200 Simplest. No codegen, no runtime. Slow (10–100× native) Bytecode VM OCaml (ZINC), Tao, PLZoo miniml 200–500 Middle ground. Portable, reasonable speed. Write ~30–50 instructions Native compilation MinCaml, mlml, AQaml 500–1500 Fast execution, but you own register allocation, calling conventions, ABI Transpile to C Koka, Scrapscript, Chicken, Austral 200–500 Best of both worlds -- portable native speed, C compiler does the hard parts Transpile to JS/Go Newt, SOSML, Borgo 200–400 Web/ecosystem deployment, but you inherit the target's performance model Combinator reduction Ben Lynn, MicroHs 100–300 No closures, no registers. Graph reduction evaluator in C. Simple but slow Lil' fun langs are usually interpreters. Without compilation, you can skip closure conversion, register allocation, and runtime systems. The leap from interpreter to compiler costs ~500–2000 LOC. Nominal vs. Structural Types type Meters = Int type Seconds = Int -- Nominal: Meters ≠ Seconds (different names) -- Structural: Meters = Seconds (same shape) Style Examples Consequence Nominal OCaml, Haskell, Austral Name is identity -- same shape doesn't mean same type Structural EYG, Grace, TypeScript, Simple-sub Shape is identity -- same fields/variants means same type Most ML-family languages are nominal for algebraic data types but structural for records (if implemented). Row polymorphism (EYG, Grace, Koka) is inherently structural -- it acts on "any record with at least these fields." Simple-sub goes further: union and intersection types, with principal inference intact. Pretty vs. Ugly Errors -- Ugly: Error: type mismatch: int vs string -- Pretty: 3 │ let x = 1 + "hello" │ ^^^^^^^^ Error: I expected an `int` here, but got a `string`. The left side of `+` is `int`, so the right side must be too. Pretty errors cannot be achieved with a coat of paint. To point at a line/region of code, you must thread source locations through every compiler phase. A minimum viable error system: Source spans on every AST node. Every expression, pattern, and type carries { file, start_line, start_col, end_line, end_col }. This costs one extra field per node. Preserve spans through desugaring. When you lower where to let, the new let node inherits the span of the where. Preserve spans through type inference. When unification fails, you need the spans of both conflicting types. Format errors with context. Show the source line, underline the relevant span, explain the mismatch. Quality Examples Cost Elm-tier Elm, Austral Purpose-built error messages per failure mode. Highest effort, best UX Good enough Tao, Ante, OCaml Source spans + generic formatting. Covers 90% of cases Positional MinCaml, most small compilers Line numbers but no span highlighting or explanation De Bruijn indices Elaboration Zoo (intentionally) Variable names lost -- fine for research, bad for users Lexing Approach Used by LOC estimate Notes Hand-written recursive MinCaml (Rust port), Tao, Ante 100–300 Full control, best errors ocamllex / mlllex MinCaml (original), HaMLet, PLZoo 50–100 Standard for OCaml/SML hosts Alex (Haskell) MicroHs, many Haskell-hosted 50–100 Standard for Haskell hosts Parser combinator (integrated) Ben Lynn, some educational 0 (part of parser) Lexerless parsing Optional enhancements: Layout/indentation sensitivity (Haskell-style offside rule): Ben Lynn implements this in later bootstrapping stages. MicroHs includes full layout parsing. Adds 100–300 LOC. The algorithm is well-described by the Haskell Report's Section 2.7. Unicode identifiers: Most small compilers skip this entirely. Koka supports it. Interpolated strings: Syntax like "hello ${name}" is not standard in ML-family, but some newer languages add it. Parsing Parsing converts the flat token stream into a tree. The surface syntax is parsed into a concrete syntax tree (CST) or directly into an abstract syntax tree (AST). ML-family languages have a well-behaved grammar that is almost LL(1). Approach Used by LOC estimate Notes Recursive descent + Pratt/precedence climbing MinCaml (Rust port), Tao, Ante 200–500 Best error messages, easiest to extend ocamlyacc / mlyacc (LALR) MinCaml (original), HaMLet 100–200 (grammar file) Standard, but poor error recovery Parser combinators (Parsec-style) Ben Lynn, MicroHs, PLZoo 100–400 Elegant, compositional, backtracking PEG / Packrat Rare in ML-family 100–300 Linear time guarantee Every subsequent phase transforms this type. In ML-family languages, the AST typically looks like: type expr = | Lit of literal (* 42, 3.14, "hello", true *) | Var of name (* x *) | App of expr * expr (* f x *) | Lam of name * expr (* fun x -> e *) (or \x -> e) | Let of name * expr * expr (* let x = e1 in e2 *) | LetRec of name * expr * expr (* let rec f = e1 in e2 *) | If of expr * expr * expr (* if c then t else f *) | Tuple of expr list (* (a, b, c) *) | Match of expr * branch list (* match e with p1 -> e1 | ... *) | Ann of expr * type (* (e : t) *) Name Resolution & Desugaring Before type inference, the surface AST is simplified: Alpha-renaming: Every binder is assigned a unique identifier to eliminate shadowing. MinCaml's Rust port does this during type checking. Most do this while parsing or during a separate pass. Fixity resolution: Infix op