A Brief And Brisk Overview of Compiler Architecture

Most compilers out there follow a particular architecture:

Preface #

In this article I intend to dissect this architecture piece by piece in some detail.

Consider this article a supplement to the plethora of resources out there on compilers. It exists as a self contained resource to get your toes wet in the world of programming language design and implementation.

The audience for this article is someone who has very limited knowledge as to how a compiler works, i.e. you know that they compile into assembly at most. Though I do presume that the reader has a good understanding of data structures & algorithms.

It is by no means reflective of modern ‘production’ compilers with millions of lines of code! But rather a very brief/brisk ‘compilers for dummies’ resource to get an idea of what goes on in a compiler.

Introduction #

Currently, I’m working on a programming language called Krug, which is a systems programming language that takes a lot of inspiration from Rust and Go. I’ll be referencing Krug a few times in this article to compare and help illustrate my points. Krug is still under very heavy development, but you can find it on GitHub under the ‘krug-lang’ organisation as caasper, and krug.
The language itself is a bit unusual compared to the typical architecture of compilers, which is partly what inspired me to write this article, though this will be discussed further on in the article.

Disclaimer! #

I would like to disclaim that I am by no means an expert in compilers! I don’t have a doctorate in Compilers, I did not study this at an academic level in any way - most of what I am sharing is what I have learned in my spare time for fun. In addition, I am not claiming what I write to be the de-facto approach for engineering a compiler, but rather introducing approaches that would be applicable for a small toy compiler.

Frontend #

Referring back to the previous diagram, the arrows on the left pointing into the box are programming languages we all know and love today - like C! The frontend looks something like this:

Lexical Analysis -> Parser

Lexical Analysis #

When I was first learning about compilers and language design, this was described to me as a ‘fancy way of saying tokenization’. So let’s go with that. The ‘lexer’ typically takes input in a form of strings or a stream of characters, and recognizes patterns in those characters to cut them up into tokens.

In the case of a compiler, the input string would be a program that the programmer would write. This would be read in from a file into a string, and then the lexer would tokenize the programs source code.

enum TokenType {
    Identifier,
    Number,
};

struct Token {
    std::string Lexeme;
    TokenType type;

    // ...
    // It's also handy to store things in here
    // like the position of the token (start to end row:col)
};

In the above snippet of some kind of C-like language, we have a Token structure which contains the aforementioned lexeme, as well as a TokenType to distinguish what kind of lexeme is stored.

Note: This post isn’t necessarily a guide to creating a language with code samples, but I will add a few bits code snippets here and there to assist translating these ideas.

Lexers are usually the easiest part of the compiler to make, in fact the entire frontend is usually quite simple relative to the other pieces of the puzzle. Though this depends on how hard you can make it for yourself 😉

Take the following piece of C code:

int main() {
    printf("Hello world!\n");
    return 0;
}

If you were to read this from a file into a string, and then linearly scan through the string you could probably cut these into tokens. We naturally identify these tokens looking at the language ourselves, e.g. it’s clear that int is a “word”, and 0 in the return statement is a “number”.
The lexer does the same kind of thing, and we can go into as much detail as necessary to ease the process later. For example, you could lex:

0xdeadbeef
1231234234
3.1412
55.5555
0b0001

As “numbers”, or you could categorize them further as:

0xdeadbeef      HexNumber
1231234234      WholeNumber
3.1412          FloatingNumber
55.5555         FloatingNumber
0b0001          BinaryNumber

As for defining “words”, it can be difficult. Most languages define a word as a grouping of letters and digits, the identifier typically must start with a letter (or an underscore), e.g.

123foobar := 3
person-age := 5
fmt.Println(123foobar)

Is not valid Go code as it is probably parsed into these tokens:

Number(123), Identifier(foobar), Symbol(:=), Number(3) ...

Most identifiers that we encounter are of the form:

foo_bar
__uint8_t
fooBar123

Lexers will also have to deal with other problems, e.g.:

… and before you think about using regular expressions to do this, I would not recommend it! It’s much easier to write a lexer from scratch, but I highly recommend reading this blog post from our lord and saviour Rob Pike.
Though, there are many articles on why Regex is not the right tool for the job so I think I’ll skim over that segment for this article.
It’s also a lot more fun to write a lexer than it is pulling your hair out over a long winded regular expression you have pasted into regex101.com at 5:24 in the morning.

My first ‘programming language’ I used the split(str) function to tokenize my input - I didn’t get very far.

Parsing #

Parsing is a bit more of a complicated beast compared to Lexical Analysis. There are many kinds of parsers out there and parser generators. This is where things start to get a bit more serious.

In compilers, a parser will usually take an input of tokens, and produce a tree of some sort. This could be an ‘Abstract Syntax Tree’, or a ‘Parse Tree’. Both of which are similar at the core, but do share differences.

You could think of these stages so far as functions:

fn lex(string input) []Token {...}
fn parse(tokens []Token) AST {...}

let input = "int main() { return 0; }";
let tokens = lex(input);
let parse_tree = parse(tokens);
// ....

Compilers are usually built up in lots of little components which take inputs and mutate them or convert them into different outputs. Which is partly why functional languages are very good for creating compilers! And also the sweet pattern matching and usually very extensive standard libraries.
Fun fact: the first implementation of the Rust compiler was implemented in OCaml.
And a piece of advice is to keep these components as self contained and simple as possible, keeping everything modular simplifies the entire process. I feel as if this philosophy applies to many aspects of software engineering however.

Trees! #

Parse Tree #

WTF is… - a parse tree? Sometimes referred to as a ‘syntax tree’, is a much more dense tree that represents the source program. They contain all (or most) of the information of the input program, usually matching what is described in the grammar of your language. Each node in the tree would be a nonterminal or terminal in the grammar, e.g. a NumberConstant node or a StringConstant node.

Abstract Syntax Tree #

The Abstract Syntax Tree (AST) is, as the name suggests, an ‘abstract’ syntax tree. The Parse Tree contains a lot of (possibly superfluous) information about your program. The point of the AST is that we don’t need all of this information to do our job. It throws away a lot of the useless structural/grammatical information that doesn’t contribute to the semantics of the program.
For example, perhaps you have an expression in your tree Parse Tree like ((5 + 5) - 3) + 2. You would store the parenthesis in the Parse Tree, and maybe that the values 5, 5, 3, and 2 are atoms, but once you can derive the associations, you can abstract away these details in the AST as we only need to know the values (the numbers) and their operators as well as the order of the operations.

Here’s another free for re-use image I found that shows the AST for a + b / c. It’s a bit large, sorry! I blame svbtle for that.

a + b / c in the form of a tree

An AST could be represented as such:

interface Expression { ... };

struct UnaryExpression {
    Expression value;
    char op;
};

struct BinaryExpression {
    Expression lhand, rhand;
    string op; // string because the op could be more than 1 char.
};

interface Node { ... };

// or for something like a variable
struct Variable : Node {
    Token identifier;
    Expression value;
};

This is a very limited representation of an AST, but you could hopefully see how you would structure your nodes.

As for parsing them, you could have a procedure like:

Node parseNode() {
    Token current = consume();
    switch (current.lexeme) {
    case "var":
        return parseVariableNode();
    // ...
    }
    panic("unrecognized input!");
}

Node n = parseNode();
if (n != null) {
    // append to some list of top level nodes?
    // or append to a block of nodes!
}

And hopefully you get the gist of how it would recursively parse other nodes from the top level constructs in the language. Though I’ll leave that to you to learn about the specifics of implementing a recursive descent parser.

Grammars #

To parse from a set of tokens into an AST can be a tricky task. Usually you would start with some kind of grammar for your language.

Grammar is basically a definition of how a language is structured. There are a few languages for defining languages, which can be described (or bootstrapped) with themselves.

Extended Backus-Naur Form (EBNF) is an example of a language for defining languages. It is based off BNF which is a bit more angle bracket-y.

Here’s an example of some EBNF taken from the Wikipedia Article:

digit excluding zero = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
digit                = "0" | digit excluding zero ;

Production rules are defined, which tell the reader what pattern of terminals make up what ‘nonterminal’. Terminals are part of the grammars alphabet, e.g. the token “if”, or in the example above “0” and “1” are terminals. Non-terminals are the opposite, they are on the left of the production rules, and can be considered variables or ‘named references’ to a grouping of terminals and non terminals.

Many languages have specifications, which contain their grammars that you can read. Here is the spec for Go, and Rust, as well as D.

Recursive Descent Parsing #

The easiest approach is using a ‘recursive descent parser’. There are many approaches to parsing, and this is one of them.

Recursive descent is a top down parser built from a set of recursive procedures. It’s a much simpler to write a parser, given that your grammar has no left recursion. For most hobby/toy languages, it’s a sufficient technique for parsing. GCC uses a hand-written recursive descent parser, though it has used YACC before.
Though there can be issues with parsing these languages, especially something like C, where:

foo * bar

Could be interpreted as:

int foo = 3;
int bar = 4;
foo * bar; // unused expression

Or it could be interpreted as:

typedef struct {
    int b;
} foo;
foo* bar;
bar.b = 3;

Clang also uses a recursive-descent parser in its implementation:

Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straightforward to implement excellent diagnostics and error recovery.

A few alternative approaches to parsing that are worth reading into are:

Parser Generators! #

Parser generators are a very good approach to take too. There are trade offs though, as there are with any choice you make when creating a piece of software.

Parser generators are usually very fast, they are a lot easier than writing your own parser (and getting a performant result from it), though they are usually not very user friendly, and don’t allow for great error messages. In addition, you then have to learn how to use your parser generator, as well as when it comes to bootstrapping your compiler, you probably have to bootstrap your parser generator too.

ANTLR is an example of a parser generator, though there are plenty more out there.

I think they are a tool for when you don’t want to focus on writing the frontend, but would rather get on with writing the middle and the backend parts of your compiler/interpreter, or dealing with whatever else you want to parse.

The Application of Parsing #

If you haven’t guessed yet. Just the frontend of a compiler (lex/parse) is very applicable to other problems:

And much more.

The Middle #

Semantic analysis! Analyzing the semantics of the language is one of the harder parts of compiler design.
It involves ensuring that the input programs are correct. My language Krug is lacking in the semantic analysis aspects currently, and without it the programmer would be trusted to write code that is correct all the time. But in reality this is never the case and we’re always writing, compiling, (maybe running), and fixing errors in a loop.
Not only this, but it can be impossible to compile programs if you can’t analyze that the semantics are correct in this analysis phase of the compiler.

I remember reading a diagram a while ago that showed the percentages of the front, middle, and back ends of a compiler and how they were split up and a while ago it was something along the lines of…

F: 20% M: 20%: B: 60%

But nowadays I feel it’s more like this:

F: 5% M: 60% B: 35%

Frontends are mostly the job of a generator or can be done very quickly to a language that is mostly context free and/or doesn’t have any ambiguities in the grammar (throw a recursive descent parser at it!).

With technology like LLVM, most of the work of optimisation can be offloaded onto the framework, already providing a plethora of optimisations out of the box!
So that leads us with semantic analysis which is a very integral part of the compilation phase.
For example, a language like Rust with its ownership memory model, the compiler mostly acts as a big fancy machine that performs all types of static analysis on the input forms. Part of the job is converting the input into a more manageable form to do this analysis.

Because of this, semantic analysis is definitely eating up a lot of the focus for compiler architecture since a lot of the tedious ground work like optimising generated assembly, or readng input into an AST is done for you.

Semantic Passes #

Most compilers will perform a large amount of ‘semantic passes’ over the AST (or some abstract form to represent the code) during semantic analysis. This article goes into detail about most of the passes that are performed in the .NET C# compiler (from 2010).

I won’t go over every pass that could be implemented, especially since this varies on the language you are implementing for, but below are a few passes that are in my language Krug.

‘Top Level’ Declaration Pass #

The compiler will go over all of the ‘top level’ declarations in the modules and acknowledge that they exist. It’s top level as it doesn’t go any further into blocks, it will simply declare what structures, functions, etc. exist in what module.

Name/Symbol Resolution #

This pass will go through all of the blocks of code in the functions, etc. and resolve them, i.e. find the symbols that they resolve to. This is a common pass, and is usually where the error No such symbol XYZ comes from when you compile your Go code.

This can be a very tricky pass to do, especially if you have cyclic dependencies in your dependency graph. Some languages do not allow for cycles, e.g. Go will error if you have packages that form a cycle. As does my language Krug! Cyclic dependencies are considered to be a side effect of poor design.

Cycles can be detected by modifying a DFS on the dependency graph, or you can use Tarjans Strongly Connected Components algorithm (like Krug does) to identify (multiple) cycles in a dependency graph.

Type Inference #

There is a pass in the compiler that will go through all variables and infer their types. The type inference in Krug is very weak, it is simply inferring the variable based off the value it contains. There is, by no means, a fancy type system like the one you would find in a function language like Haskell.

Type inference can be done using a process called ‘unification’, or ‘type unification’. Though you can have some very simple implementations for simpler type systems.

Types in Krug are implemented as such:

interface Type {};

struct IntegerType : Type {
    int width;
    bool signed;
};

struct FloatingType : Type {
    int width;
};

struct ArrayType : Type {
    Type base_type;
    uint64 length;
};

And you could have some simple type inference where you would assign expression nodes a type, e.g. an IntegerConstantNode would have the type IntegerType(64). And then you have a function unify(t1, t2) which picks the widest type that can be used for inferring the type of more complex expressions like a BinaryExpression. Then it’s a matter of assigning the left hand variable the right hand values inferred type.
A while back I wrote a simple type inferrer in Go that prototyped how Krugs inference would be implemented.

Mutability Pass #

Krug is (like Rust) an immutable by default language, meaning that variables are constant unless specified otherwise:

let x = 3;
x = 4; // BAD!

mut y = 5;
y = 6; // OK!

This pass in the compiler will run through all of the blocks and functions and ensure that they are ‘const correct’, i.e. we are not mutating anything we shouldn’t, and that all values passed to certain functions are constant or mutable where need be.

This is done with symbol information that is collected from prior passes. A symbol table is built up in the semantic pass which contains information like the token name, and whether the variable is mutable or not. It could also contain other information, e.g. in the case of C++, if the symbol is extern or static.

Symbol Tables #

A symbol table or ‘stab’ is a lookup table for symbols that exist in your program. These exist per scope and contain all of the symbol information for the said scope.
The symbol information is contains properties like the name of the symbol, it could contain the type information too, as well as if the symbol is mutable or not, if it should be externally linked, is it in static memory, etc.

Scope #

Scope is an important concept in programming languages. Of course your language doesn’t necessarily have to allow for nested scope, it could all be in one namespace!
Though representing Scope is an interesting problem in compiler design, scope behaves like (or is) a stack data structure in most c-like languages.
Usually, you would push and pop scope, and normally these would control names, e.g. allowing for shadowing of variables:

{ // push scope
    let x = 3;
    { // push scope
        let x = 4; // OK!
    } // pop scope
} // pop scope

And this can be represented in a few ways:

struct Scope {
    Scope* outer;
    SymbolTable symbols;
}

Somewhat irrelevant, but interesting reading/knowledge: Spaghetti stack. This is a data structure that was used to store the scopes in their counterpart block AST nodes. It’s often referred to as a spaghetti stack!

Type Systems #

Many of these headings could be their own articles, but I feel like this heading probably takes the cake for that.
There is a lot of information on type systems out there, and there are many kinds of type systems, and a lot of heated debate about everything. I wont gloss over this topic too much at all, but I will link to this excellent article by Steve Klabnik.
Though the point of this header, is that the type system is something that is enforced and defined semantically in the middle of the compiler with aid of the compilers representations as well as the analysis of these representations.

Ownership #

Ownership is a concept that is becoming more and more prevalent in the programming world. Ownership and move semantics are principles in the language Rust — and hopefully more to come. There are many forms of static analysis performed on Rust code that checks that input conforms to a set of rules with regards to memory: who owns what memory, when the memory dies, and how many references (or borrows) to these values/memory there are.

The beauty of Rust is that this is all enforced at compile time, during the middle of the compiler, so there is no garbage collection or reference counting forced upon the programmer. These semantics are offset to the type system and can be enforced before the program even exists as a compiled binary.

I can’t speak on the internals of how this all works, but I can tell you that it is the work of static analysis and some cool research by the folks at Mozilla and the people behind Cyclone.

Control Flow Graphs #

To represent a programs flow, we use Control Flow Graphs (CFG), which contains all the paths that may be traversed during the execution of a program. This is used in semantic analysis to handle dead code elimination, e.g. blocks that wont ever be reached, or functions, or even modules. It can also be used to determinte if a loop wont stop iterating, for example. Or unreachable code, e.g. you call a panic or return in a loop and the code outside of the loop doesn’t get to execute. Data Flow Analysis plays a prominent role during the semantic phase of a compiler, and it’s worth reading up on the types of analysis you can do, how it works, and what optimisations can come from it.

Backend #

an image of a barren wasteland

The final part in our architecture diagram.

This is where most of the work is done to produce our binary executable. There are a few ways to do this, which we will discuss in the later segments of this article.

The semantic analysis phase doesn’t necessarily have to mutate a lot of the information on the tree, and it’s probably a better idea not to in terms of avoiding some spaghetti mess.

A note on transpilers #

Transpilers are another form of compiler, in which the compiler transpiles into another ‘source level’ language, e.g. you could write something that compiles into C code. I think this is somewhat pointless though if your language doesn’t have a lot to offer on top of the language its compiling to. It mostly seems to make sense for languages that are relatively high level or the language itself is limited.
However, compiling to C is a very established habit in the history of compilers, in fact the first C++ compiler ‘Cfront’ compiled into C code.

A good example for this is JavaScript. TypeScript (and many other languages) transpile into JavaScript to introduce more features to the language, and most importantly a sensible type system with various amounts of static analysis to catch bugs and errors before we encounter them at runtime.

This is one type of ‘target’ for a compiler, and it’s usually the easiest as you don’t have to think in more lower level concepts about assigning variables, or handling optimisations, etc. as you are mostly piggy backing on top of another language. Though the obvious downside is that you have a lot of overhead, and are usually confined within the language you are compiling to.

LLVM #

Many modern compilers will opt for using LLVM as their backend: Rust, Swift, C/C++ (clang), D, Haskell.
This can be considered the ‘easier route’, as most of the work is done for you in supporting a wide variety of architectures, and providing an insurmountable level of optimisation. In contrast to the aforementioned route of transpilation, you get quite a lot of control with LLVM too. More so than if you were to compile to C. For example, you can decide how large types will be, e.g. 1 bit, 4 bits, 8 bits, 16 bits - which isn’t as easy in C, and sometimes not possible, or not even defined for certain platforms.

Generating Assembly #

Generating code directly for a specific architecture, i.e. machine code or assembly is technically the most common route with countless languages opting for this route.

Go is an example of a modern language that does not take advantage of the LLVM framework (as of writing this). It generates code for a few platforms/architectures: Windows, Linux, and MacOS to name a few. And a fun fact, my language Krug did generate assembly earlier on in the prototype version of the language.

There are lots of pros and cons to this, though nowadays with technology like LLVM available it’s unwise to generate assembly code yourself as it is unlikely a toy compiler that has its own assembly backend would surpass LLVMs level of optimisation for one platform let alone multiple.

That being said, a considerable benefit of generating assembly yourself is that your compiler will likely be a lot faster than if you were to use a framework like LLVM where it has to build your IR, optimise it, etc. and then eventually write it out as assembly (or whatever target you pick).

Regardless, it’s still enjoyable to attempt. And is especially interesting if you wanted to learn more about programming in assembly or the lower levels of how languages work. The easiest way to approach this is to walk the AST, or walk the generated IR (if you have one) and ‘emit’ assembly instructions to a file using fprintf or some file writer utility. This is how 8cc works.

Bytecode Generation #

Another option is generating bytecode for some kind of virtual machine or a bytecode interpreter. Java is a prime example of this, in fact the JVM has spawned an entire family of languages that generate bytecode for it, e.g. Kotlin.

There are many benefits of generating bytecode, the main reason for Java was for portability. If you can have your virtual machine run anywhere, any code that executes on the virtual machine will run anywhere too. And it’s a lot simple to make something like an abstract set of bytecode instructions run on machines than it is to target 50 bajillion computer architectures.
As far as I know the JVM will also JIT frequently run ‘hot code’ into a native function, and other such JIT tricks to squeeze out extra performance from code.

Optimisations #

Optimisations are integral to a compiler, no one wants slow code! They will usually be the larger part of the backend, and there is a large magnitude of research on squeezing out the extra bits of performance with code.
If you ever compile some C code and run it with all optimisations on full flex, it can be amazing what kind of madness it can produce. godbolt’s compiler explorer is a great tool to look into how existing compilers generate their code, what instructions relate to what source code, as well as you can specify certain levels of optimisations, targets, versions of compilers, etc.

A good start if you are ever writing a compiler is to write simple programs in C and turn off all optimisations, as well as strip the debug symbols, and have a look at what code GCC generates. It can be a handy reference if you ever get stuck!

The importance with optimisations is that you can trade off accuracy of the program for speed, and finding right balance can be difficult. Some optimisations are also very specific on their use case and can in some instances produce the wrong result. These optimisations usually don’t find themselves in production compilers for obvious reasons!

An interesting comment taken from the HN thread on this article from user ‘rwmj’ is that you only need around 8 optimisation passes to get 80% of the best case performance from your compiler, all of which were catalogued in 1971! This was in a compilers talk from Graydon Hoare, the mastermind behind Rust.

IR #

Having an intermediate representation (IR) is not required, but definitely beneficial. You could generate code from the AST, though it can become quite tedious and messy to do so, as well as it’s quite difficult to optimise.

An IR can be thought of as a higher level representation of the code that you are generating for. It must be very accurate to what it represents, and must contain all of the information necessary to generate the code.

There are certain types of IR, or ‘forms’ you can make with the IR to allow for easier optimisations. One example of this is SSA, or Static Single Assignment, in which every variable is assigned exactly once.
Go builds an SSA based IR before it generates code. LLVM’s IR is built upon the concept of SSA to provide its optimisations.

SSA provides a few optimisations by nature, for example constant propagation, dead code elimination, and (a big one) register allocation.

Register Allocation #

Register allocation is not a necessity when it comes to generating code, but an optimisation. One abstraction we take for granted is that we can define as many variables as required for our programs. In assembly, however, we can either make use of the finite amount of registers [usually 16 to 32] available (and keep track of them in our heads), or we can spill to the stack.

Register allocation is an attempt to find what variables can go in what registers at what point of time (without overwriting other values). This is much more efficient than spilling to the stack, though can be quite expensive and impossible for a computer to calculate the perfect solution.
A few algorithms for register allocation are: graph colouring, which is as computationally hard problem (NP-complete). Or a linear scan which will scan the variables to determine their liveness ranges - as opposed to graph colouring which requires the code is in graph form to calculate the liveness of variables.

Things to consider #

There is a vast amount of information on compilers. So much to cover that it would not fit nicely into this article. That being said, I wanted to write up, or at least mention, a few bits and pieces that should be considered for any of your future endeavours.

Name Mangling #

If you are generating assembly where there isn’t really any scope or namespace, you will have an issue with conflicting symbols in a lot of cases. Especially if your language supports function overloading or classes, etc.

fn main() int {
    let x = 0;
    {
        let x = 0;
        {
            let x = 0;
        }
    }
    return 0;
}

For instance, the previous example (if the variables don’t get optimised out anyway 😉) you will have to mangle the names of those symbols so that they would not conflict in the generated assembly. Name mangling usually indicates type information too, or it could contain scope information, etc.

Debug Information #

Tools like LLDB usually integrate with standards like DWARF. Another excellent feature of LLVM is that you get some relatively easy integration with the existing GNU debugger tool via. DWARF. Your language would probably need a debugger, it’s always easiest to use someone else’s, unless you were to roll your own.

Foreign Function Interface (FFI) #

There is usually no escape from libc, you should probably read up on this and think about how you would incorporate this in your language. How will you hook into C code, or expose your languages code to C?

Linking #

Writing your own linker is a task of its own. When your compiler generates code, does it generate a machine code of some sort (i.e. into a .s/.asm file)? Does it write the code directly to an object file? Jonathan Blow’s programming language Jai supposedly writes all of the code into a single object file. There are many different approaches to this with varying trade offs.

Compiler-as-a-Service (CAAS) #

This is where the compiler phases that have been discussed in this post are cut into API routes, which means that a text editor could request to the Krug server to tokenize a file and get back a response of the tokens that it produced. In addition, all of the static analysis routes are exposes, so tooling becomes a breeze.

There are obviously trade offs like latency between sending and receiving files over, as well as a lot of the design of the compiler has to be thought of differently to work in the context of API routes.

Not a lot of production compilers seem to approach compiler creation with this method. One that comes to mind is Microsofts ‘Roslyn’, though admittedly I do not know a lot about Rosyln, so I’ll leave that for you to look into. And, I could be wrong here, but a lot of compilers seem to be implementing this, but they write the API route that hooks into the existing compiler, e.g. Rust has RLS.

My own language, Krug - which is still under heavy development and is prone to breaking - uses this architecture for its compiler ‘Caasper’.
The compiler itself ‘caasper’ runs locally on your machine (or on a server if you wanted to), and then the frontends or clients, e.g. ‘krug’ communicate to and from this compiler service.

One benefit of this is that you can have many frontend implementations, and a single frontend could be bootstrapped into the language itself before re-writing the entire compiler.

The frontend for Krug is implemented in JavaScript, though there will be alternative frontends implemented in Go*, as well as a frontend hopefully written in Krug itself. JavaScript was chosen as it is quite an accessible language and can be downloaded with the very popular package managers yarn/npm.

*The initial frontend was written in Go, and was (unsurprisingly) significantly faster than the JS frontend.

The source code for Caasper - the compiler itself - can be found here if you wanted to take a look.
In addition, a prototype of Krug is available on my personal GitHub here, it’s written in D and compiles down to LLVM. And finally, there is a demo of it on my YouTube channel here.

You can read the (work in progress) Krug tutorial here.

Hello! #

Thanks for making it this far. I hope you enjoyed the article. Feel free to follow me on Twitter @Felix_Angell.

I will likely be updating this article as the time passes with more information or edits here and there, and I’ll likely Tweet when I do so.

Feel free to shoot me an email at [email protected] with any edits, corrections, or requests for what I should write next!

Further Reading #

 
171
Kudos
 
171
Kudos

Now read this

Virtual machine in C

Hacker News Dicussion Article discussion on Hacker News Introduction # Here’s the GitHub to show what we’ll be making. You can also compare your code to this repository in case you have any errors: GitHub Repository I felt like writing... Continue →