How Compilers Work: From Kaleidoscope to LLVM

What is a compiler

A compiler is a program that transforms source code written by humans into machine code that the processor can execute. Although it sounds simple, this process involves several well-defined stages organized in a pipeline:

Frontend (parsing): reads the source code, verifies the syntax, and builds an internal structure called an AST (Abstract Syntax Tree).
Intermediate representation (IR): the AST is converted into an intermediate form, independent of both the source language and the target architecture.
Optimizations: the IR goes through various transformations that improve performance without altering the program’s behavior.
Backend (code generation): the optimized IR is translated into native instructions for the target architecture (x86, ARM, RISC-V, etc.).
Binary: the final result is an executable or library that runs directly on the hardware.

This separation into stages allows different languages to share the same backend and the same optimizations, as long as they generate the same IR. This modular principle is exactly what LLVM leverages.

What is LLVM

LLVM (originally Low Level Virtual Machine, though the name is now treated as a standalone brand) is a collection of modules and tools for building compilers. More than a specific compiler, it is a compiler infrastructure designed to be reusable and extensible.

The project was born in 2004 at the University of Illinois, as Chris Lattner’s doctoral thesis, under the supervision of Vikram Adve. From the start, LLVM was based on the concept of SSA (Static Single Assignment), a way of representing programs where each variable is assigned exactly once, which drastically simplifies code analysis and optimization.

LLVM components

The LLVM ecosystem goes far beyond a single compiler. Its main components include:

LLVM Core: the fundamental libraries that define the IR, the type system, the optimizer, and the code generation backends for various architectures.
Clang: the C, C++, and Objective-C compiler built on top of LLVM. Known for clear error messages and competitive compilation times.
LLDB: a next-generation debugger that replaces GDB in many workflows, especially in the Apple ecosystem.
MLIR (Multi-Level Intermediate Representation): a framework for defining and optimizing intermediate representations at different levels of abstraction, widely used in machine learning compilers.
Polly: a polyhedral optimizer focused on advanced loop transformations and automatic parallelism.
OpenMP: implementation of parallelization support via OpenMP directives within the LLVM/Clang ecosystem.

This modularity is what makes LLVM so popular: you can use only the parts you need to build your own compiler or analysis tool.

LLVM IR: the heart of the project

The intermediate representation (IR) is the central component that connects frontends to backends. Think of it as a universal assembly: human-readable, independent of the target architecture, and rich enough to enable sophisticated optimizations.

A simple example of LLVM IR for a function that adds two integers:

define i32 @sum(i32 %a, i32 %b) {
entry:
  %result = add i32 %a, %b
  ret i32 %result
}

A few things to note:

i32 indicates a 32-bit integer.
%a, %b, and %result are virtual registers in SSA form (each one is assigned a value exactly once).
The function is declared explicitly, with types in every position.

The IR can exist in three forms: readable text (.ll), binary bitcode (.bc), and in-memory representation during compilation. This flexibility allows different tools to consume and produce IR at any stage of the pipeline.

Kaleidoscope: learning compilers in practice

Kaleidoscope is a toy language created for educational purposes in the official LLVM tutorial. It is deliberately simple, elegant, and visually clear, with only one data type: 64-bit floating-point numbers (double).

Here is an example that computes the nth Fibonacci number:

# Computes the nth Fibonacci number
def fib(x)
  if x < 3 then
    1
  else
    fib(x-1) + fib(x-2)

fib(40)

Despite its simplicity, implementing a compiler for Kaleidoscope teaches the complete fundamentals of compiler construction:

Lexing: transforming a sequence of characters into tokens (keywords, identifiers, operators).
Parsing: organizing tokens into an abstract syntax tree (AST) that represents the program’s structure.
AST: the central data structure that connects the frontend to the rest of the pipeline.
Code generation with LLVM: traversing the AST and emitting calls to the LLVM API to generate the corresponding IR.
JIT compilation: compiling and executing code in real time using LLVM’s JIT engine, without needing to generate an on-disk executable.

The tutorial guides the reader step by step, from the most basic lexer to adding control structures, mutable variables, and optimizations. By the end, you have a functional compiler with JIT that runs in just a few hundred lines of C++.

Why this matters

Understanding compilers is not just an academic curiosity. LLVM is present in projects that millions of developers use daily:

Rust uses LLVM as the backend for rustc, which allows the Rust compiler to generate highly optimized native code for dozens of architectures.
Swift was designed from the ground up around LLVM, also by Chris Lattner.
Chromium and the V8 JavaScript engine leverage JIT compilation techniques that share concepts with LLVM.
Several languages such as Julia, Kotlin Native, and Zig also use LLVM in their compilation pipelines.

Understanding how compilers work helps you write better code: you understand why certain constructs are more efficient, how the compiler optimizes loops, why inlining matters, and what happens between the return you write and the ret instruction the processor executes.

Furthermore, LLVM’s modularity has opened the door to innovation in areas such as GPU compilers, DSLs (Domain-Specific Languages), and even machine learning model compilation with tools like MLIR.