The Magical Art of Writing Compilers

Why Write a Compiler?

To me, programming languages have always felt arcane — like magical runes. You write strange symbols, and your computer suddenly obeys. With the rise of new languages like Rust, Zig, and Odin, I wanted to understand the magic myself: what would it take to build my own language?

Setting Goals

My goal was to keep things simple — close to C — but with modern improvements to support common idioms like slices, native strings, option/result types, and tagged unions.

Which Came First — the Compiler or the Language?

If a compiler is a program that translates high-level code into machine instructions, how do you write the first compiler when you don’t yet have the language?

The answer is straightforward: write the first version of your compiler in another language. Then, use that compiler to recompile itself — this process is called self-hosting. This is exactly how Clang, one of the most popular C++ compilers, was built.

In this article, I’ll focus on bootstrapping that very first version.

Lexing

The first step of compilation is to strip away everything that doesn’t affect the grammar — whitespace, newlines, and comments — and convert the code into a sequence of tokens. This process is called lexical analysis.

The lexer scans the source code for recognizable patterns and produces tokens for the parser to analyze syntactically.

Lexers as Iterators

A simple and elegant approach is the lookahead lexer (LL1), which works like an iterator:

		typedef struct {
			int start;
			int end;
			string source;
		} lexer;

		token lexer_peek(lexer* lex); // look ahead without consuming
		token lexer_next(lexer* lex); // consume and advance

Lexers are lazy — they only compute when you call peek or next.

Parsing

Extended Backus–Naur Form (EBNF)

EBNF is a meta-language used to describe programming language grammars. It’s an extension of the classic Backus–Naur Form.

Here’s an example grammar for expressions:

		expr = number | unaryop expr | expr binop expr
		unaryop = "-" | "&" | "*"
		binop = "+" | "-" | "*" | "/"

EBNF helps you reason about and refine your grammar to eliminate ambiguities.

Tagged Unions

To represent expression nodes like the ones above, we need a structure that can hold multiple data types. A tagged union does exactly that. I use them throughout the compiler:

		typedef struct expr expr;

		typedef enum {
			expr_type_term,
			expr_type_unary,
			expr_type_binary
		} expr_type;

		typedef struct expr {
			expr_type type;
			union {
				int term;
				struct {
					operation op;
					expr* rhs;
				} unary;
				struct {
					operation op;
					expr* lhs;
					expr* rhs;
				} binary;
			};
		} expr;

Note that recursive definitions require pointers because structs must have a fixed size.

Recursive Descent Parsing

A recursive descent parser is one of the simplest and most natural ways to implement parsing. To parse an expression, you check for a number first, then unary operators, and finally binary expressions:

		expr = number | unaryop expr | expr binop expr

This leads to elegant, human-readable parser code:

		expr parse_expr(lexer* lex) {
			if (lexer_peek(lex).type == token_number) {
				return parse_expr_number(lex);
			}

			if (is_unary_op(lexer_peek(lex))) {
				return parse_expr_unary(lex);
			}

			expr lhs = parse_expr(lex);
			operator op = parse_operator(lex);
			expr rhs = parse_expr(lex);

			return expr_binary(op, lhs, rhs);
		}

Left Recursion

One limitation of recursive descent is that grammars with left recursion can’t be parsed directly — they lead to infinite recursion:

expr = expr binop expr | number

Intermediate Representation (IR)

After parsing, most compilers convert the syntax tree into an Intermediate Representation (IR), which is later lowered into machine code.

This approach allows reuse and optimization — IRs can even be passed to backends like LLVM for code generation.

My language’s IR uses tagged unions once again:

		typedef struct {
			instruction_type type;
			union {
				instruction_assign assign;
				instruction_func func;
				instruction_scope scope;
				instruction_return ret;
				instruction_intrinsic intrinsic;
			} as;
		} instruction;

Symbol Table

A symbol table tracks where variables and functions live in memory and ensures they’re used correctly. I initially thought a hashmap would suffice, but scopes made it more complex than expected.

Properties of a Symbol Table

It must allow multiple symbols with the same name (e.g., two foo variables in nested scopes).
Symbols should be added at scope entry and removed on exit.

Doubly Linked List

I implemented my symbol table as a doubly linked list:

		typedef struct symbol symbol;

		struct symbol {
			symbol_type type;
			string identifier;
			size_t offset;
			size_t size;

			symbol* next;
			symbol* prev;
		};

		typedef struct {
			symbol* first;
			symbol* last;
		} symbol_list;

This makes it easy to “rollback” to a previous scope by restoring the first and last pointers.

Here’s an example of symbol table usage during scope compilation:

		void compile_scope(scope sc, symbol_list* symbols) {
			symbol_list saved = *symbols;
			size_t offset = 0;

			for (int i = 0; i < sc.statements.len; i++) {
				statement stm = sc.statements.ptr[i];
				switch (stm.type) {
				case statement_type_scope:
					compile_scope(stm.as.scope, symbols);
					break;

				case statement_type_decl: {
					size_t size = size_of_type(stm.as.declaration.type);
					symbol_add(symbols, symbol_type_local,
						stm.as.declaration.identifier, offset, size);
					offset += size;
				} break;

				case statement_type_assign: {
					symbol* sym = symbol_lookup(symbols, stm.as.assignment.identifier);
					if (!sym) continue;

					instruction ins = {0};
					ins.type = INS_ASSIGN;
					ins.as.assign.value = compile_expression(stm.as.assignment.value, symbols);

					array_append(&instructions, ins);
				} break;
				}
			}

			*symbols = saved;
		}

symbol_lookup walks backward through the list to find the most recent symbol definition.

Epilogue

This is just the beginning — I still have a long way to go, especially in the code generation and assembly stages. I’ll definitely cover that in part 2.