The Magical Art of Writing Compilers
Why Write a Compiler?
To me, programming languages have always felt arcane — like magical runes. You write strange symbols, and your computer suddenly obeys. With the rise of new languages like Rust, Zig, and Odin, I wanted to understand the magic myself: what would it take to build my own language?
Setting Goals
My goal was to keep things simple — close to C — but with modern improvements to support common idioms like slices, native strings, option/result types, and tagged unions.
Which Came First — the Compiler or the Language?
If a compiler is a program that translates high-level code into machine instructions, how do you write the first compiler when you don’t yet have the language?
The answer is straightforward: write the first version of your compiler in another language. Then, use that compiler to recompile itself — this process is called self-hosting. This is exactly how Clang, one of the most popular C++ compilers, was built.
In this article, I’ll focus on bootstrapping that very first version.
Lexing
The first step of compilation is to strip away everything that doesn’t affect the grammar — whitespace, newlines, and comments — and convert the code into a sequence of tokens. This process is called lexical analysis.
The lexer scans the source code for recognizable patterns and produces tokens for the parser to analyze syntactically.
Lexers as Iterators
A simple and elegant approach is the lookahead lexer (LL1), which works like an iterator:
typedef struct {
int start;
int end;
string source;
} lexer;
token lexer_peek(lexer* lex); // look ahead without consuming
token lexer_next(lexer* lex); // consume and advance
Lexers are lazy — they only compute when you call peek or next.
Parsing
Extended Backus–Naur Form (EBNF)
EBNF is a meta-language used to describe programming language grammars. It’s an extension of the classic Backus–Naur Form.
Here’s an example grammar for expressions:
expr = number | unaryop expr | expr binop expr unaryop = "-" | "&" | "*" binop = "+" | "-" | "*" | "/"
EBNF helps you reason about and refine your grammar to eliminate ambiguities.
Tagged Unions
To represent expression nodes like the ones above, we need a structure that can hold multiple data types. A tagged union does exactly that. I use them throughout the compiler:
typedef struct expr expr;
typedef enum {
expr_type_term,
expr_type_unary,
expr_type_binary
} expr_type;
typedef struct expr {
expr_type type;
union {
int term;
struct {
operation op;
expr* rhs;
} unary;
struct {
operation op;
expr* lhs;
expr* rhs;
} binary;
};
} expr;
Note that recursive definitions require pointers because structs must have a fixed size.
Recursive Descent Parsing
A recursive descent parser is one of the simplest and most natural ways to implement parsing. To parse an expression, you check for a number first, then unary operators, and finally binary expressions:
expr = number | unaryop expr | expr binop expr
This leads to elegant, human-readable parser code:
expr parse_expr(lexer* lex) {
if (lexer_peek(lex).type == token_number) {
return parse_expr_number(lex);
}
if (is_unary_op(lexer_peek(lex))) {
return parse_expr_unary(lex);
}
expr lhs = parse_expr(lex);
operator op = parse_operator(lex);
expr rhs = parse_expr(lex);
return expr_binary(op, lhs, rhs);
}
Left Recursion
One limitation of recursive descent is that grammars with left recursion can’t be parsed directly — they lead to infinite recursion:
expr = expr binop expr | number
Intermediate Representation (IR)
After parsing, most compilers convert the syntax tree into an Intermediate Representation (IR), which is later lowered into machine code.
This approach allows reuse and optimization — IRs can even be passed to backends like LLVM for code generation.
My language’s IR uses tagged unions once again:
typedef struct {
instruction_type type;
union {
instruction_assign assign;
instruction_func func;
instruction_scope scope;
instruction_return ret;
instruction_intrinsic intrinsic;
} as;
} instruction;
Symbol Table
A symbol table tracks where variables and functions live in memory and ensures they’re used correctly. I initially thought a hashmap would suffice, but scopes made it more complex than expected.
Properties of a Symbol Table
- It must allow multiple symbols with the same name (e.g., two
foovariables in nested scopes). - Symbols should be added at scope entry and removed on exit.
Doubly Linked List
I implemented my symbol table as a doubly linked list:
typedef struct symbol symbol;
struct symbol {
symbol_type type;
string identifier;
size_t offset;
size_t size;
symbol* next;
symbol* prev;
};
typedef struct {
symbol* first;
symbol* last;
} symbol_list;
This makes it easy to “rollback” to a previous scope by restoring the first and last pointers.
Here’s an example of symbol table usage during scope compilation:
void compile_scope(scope sc, symbol_list* symbols) {
symbol_list saved = *symbols;
size_t offset = 0;
for (int i = 0; i < sc.statements.len; i++) {
statement stm = sc.statements.ptr[i];
switch (stm.type) {
case statement_type_scope:
compile_scope(stm.as.scope, symbols);
break;
case statement_type_decl: {
size_t size = size_of_type(stm.as.declaration.type);
symbol_add(symbols, symbol_type_local,
stm.as.declaration.identifier, offset, size);
offset += size;
} break;
case statement_type_assign: {
symbol* sym = symbol_lookup(symbols, stm.as.assignment.identifier);
if (!sym) continue;
instruction ins = {0};
ins.type = INS_ASSIGN;
ins.as.assign.value = compile_expression(stm.as.assignment.value, symbols);
array_append(&instructions, ins);
} break;
}
}
*symbols = saved;
}
symbol_lookup walks backward through the list to find the most recent symbol definition.
Epilogue
This is just the beginning — I still have a long way to go, especially in the code generation and assembly stages. I’ll definitely cover that in part 2.