The Magical Art of Writing Compilers

Project Github

Why Write a Compiler?

To me, programming languages have always felt arcane — like magical runes. You write strange symbols, and your computer suddenly obeys. With the rise of new languages like Rust, Zig, and Odin, I wanted to understand the magic myself: what would it take to build my own language?

Setting Goals

My goal was to keep things simple — close to C — but with modern improvements to support common idioms like slices, native strings, option/result types, and tagged unions.

Which Came First — the Compiler or the Language?

If a compiler is a program that translates high-level code into machine instructions, how do you write the first compiler when you don’t yet have the language?

The answer is straightforward: write the first version of your compiler in another language. Then, use that compiler to recompile itself — this process is called self-hosting. This is exactly how Clang, one of the most popular C++ compilers, was built.

In this article, I’ll focus on bootstrapping that very first version.

Lexing

The first step of compilation is to strip away everything that doesn’t affect the grammar — whitespace, newlines, and comments — and convert the code into a sequence of tokens. This process is called lexical analysis.

The lexer scans the source code for recognizable patterns and produces tokens for the parser to analyze syntactically.

Lexers as Iterators

A simple and elegant approach is the lookahead lexer (LL1), which works like an iterator:

		typedef struct {
			int start;
			int end;
			string source;
		} lexer;

		token lexer_peek(lexer* lex); // look ahead without consuming
		token lexer_next(lexer* lex); // consume and advance
		

Lexers are lazy — they only compute when you call peek or next.

Parsing

Extended Backus–Naur Form (EBNF)

EBNF is a meta-language used to describe programming language grammars. It’s an extension of the classic Backus–Naur Form.

Here’s an example grammar for expressions:

		expr = number | unaryop expr | expr binop expr
		unaryop = "-" | "&" | "*"
		binop = "+" | "-" | "*" | "/"
		

EBNF helps you reason about and refine your grammar to eliminate ambiguities.

Tagged Unions

To represent expression nodes like the ones above, we need a structure that can hold multiple data types. A tagged union does exactly that. I use them throughout the compiler:

		typedef struct expr expr;

		typedef enum {
			expr_type_term,
			expr_type_unary,
			expr_type_binary
		} expr_type;

		typedef struct expr {
			expr_type type;
			union {
				int term;
				struct {
					operation op;
					expr* rhs;
				} unary;
				struct {
					operation op;
					expr* lhs;
					expr* rhs;
				} binary;
			};
		} expr;
		

Note that recursive definitions require pointers because structs must have a fixed size.

Recursive Descent Parsing

A recursive descent parser is one of the simplest and most natural ways to implement parsing. To parse an expression, you check for a number first, then unary operators, and finally binary expressions:

		expr = number | unaryop expr | expr binop expr
		

This leads to elegant, human-readable parser code:

		expr parse_expr(lexer* lex) {
			if (lexer_peek(lex).type == token_number) {
				return parse_expr_number(lex);
			}

			if (is_unary_op(lexer_peek(lex))) {
				return parse_expr_unary(lex);
			}

			expr lhs = parse_expr(lex);
			operator op = parse_operator(lex);
			expr rhs = parse_expr(lex);

			return expr_binary(op, lhs, rhs);
		}
		

Left Recursion

One limitation of recursive descent is that grammars with left recursion can’t be parsed directly — they lead to infinite recursion:

expr = expr binop expr | number

Intermediate Representation (IR)

After parsing, most compilers convert the syntax tree into an Intermediate Representation (IR), which is later lowered into machine code.

This approach allows reuse and optimization — IRs can even be passed to backends like LLVM for code generation.

My language’s IR uses tagged unions once again:

		typedef struct {
			instruction_type type;
			union {
				instruction_assign assign;
				instruction_func func;
				instruction_scope scope;
				instruction_return ret;
				instruction_intrinsic intrinsic;
			} as;
		} instruction;
		

Symbol Table

A symbol table tracks where variables and functions live in memory and ensures they’re used correctly. I initially thought a hashmap would suffice, but scopes made it more complex than expected.

Properties of a Symbol Table

Doubly Linked List

I implemented my symbol table as a doubly linked list:

		typedef struct symbol symbol;

		struct symbol {
			symbol_type type;
			string identifier;
			size_t offset;
			size_t size;

			symbol* next;
			symbol* prev;
		};

		typedef struct {
			symbol* first;
			symbol* last;
		} symbol_list;
		

This makes it easy to “rollback” to a previous scope by restoring the first and last pointers.

Here’s an example of symbol table usage during scope compilation:

		void compile_scope(scope sc, symbol_list* symbols) {
			symbol_list saved = *symbols;
			size_t offset = 0;

			for (int i = 0; i < sc.statements.len; i++) {
				statement stm = sc.statements.ptr[i];
				switch (stm.type) {
				case statement_type_scope:
					compile_scope(stm.as.scope, symbols);
					break;

				case statement_type_decl: {
					size_t size = size_of_type(stm.as.declaration.type);
					symbol_add(symbols, symbol_type_local,
						stm.as.declaration.identifier, offset, size);
					offset += size;
				} break;

				case statement_type_assign: {
					symbol* sym = symbol_lookup(symbols, stm.as.assignment.identifier);
					if (!sym) continue;

					instruction ins = {0};
					ins.type = INS_ASSIGN;
					ins.as.assign.value = compile_expression(stm.as.assignment.value, symbols);

					array_append(&instructions, ins);
				} break;
				}
			}

			*symbols = saved;
		}
		

symbol_lookup walks backward through the list to find the most recent symbol definition.

Epilogue

This is just the beginning — I still have a long way to go, especially in the code generation and assembly stages. I’ll definitely cover that in part 2.