GotaGuide

Various tech guides for various things.

View on GitHub

How To Make A Programming Language

Ever wanted to create your own programming language?

Table Of Contents

Introduction

Hello, this is Gota7. Before we have been doing some setup and some learning, but now it is time to actually do the work. Well, I sure hope you did the setup and learning in the previous pages. If not, slap yourself in the face and go read them now, not skim them, read and understand them as it will help you here. It will benefit you to try and practice making a simple language with me first in order to make something more complicated in the end. For your sake, there will be no final code for you to copy paste. :}

So Uh, How Do We Begin?

Great question! Let’s first take a look at some sample Toylet code from earier:

func doThing(number num, string str, decimal dec) {
    number a = num + 5 * 3 - 1;
    a += (number)dec;
    println(str + a);
}

doThing(3, "Result: ", 5.7);

As you can see, there are a lot of moving parts here. But before we can even think about implementing anything, we have to get a basic grammar that runs at the very least. Put the following in Toylet.g4:

// Name of our grammar.
grammar Toylet;

// This is our starting point.
init: statement* ;

// Represents a generic statement.
statement
    :   'TODO'  // We need at least something to make this grammar valid.
    ;

Now make sure Test.toy is blank. Simply run the run script with either Test.toy as a parameter, or with no parameters and hitting Ctrl+D after it starts (normally you enter text you want to test before hitting Ctrl+D, but we have none in this case). When ever I say to run an input, this is what you will be doing. There will be no more grammar files except for Toylet.g4, but you can have as many test .toy files as you wish. If you want your folder to be cleared from the garbage ANTLR makes, run the clean script. Anyways, you should see this output:

alt text

I mean what did you expect? We didn’t give it any input, or any true rules. Just the bare minimum required in order to get the language running.

Baby’s First Step

Whitespace is stupid, we don’t like it in Toylet. It will interfere with interpretting, and is not needed as we use brackets and semicolons. Your first assignment is to make a WHITESPACE Token that uses -> skip to skip over whitespace. Can you also figure out how to implement C style // and /* */ comments (LINE_COMMENT and COMMENT respectively)? Hints: . will act like a character in ANTLR that will match anything, you can use operators on it too. You also have to make sure a comment takes precedence over whitespace (as it has a newline in it), remember how to make something match first over another thing? Also note that \r is used sometimes, but not always. ~ followed by a character will treat that character as a noninclusive ending point.

Test Bench

Go ahead and run the below input into your grammar (either by putting it in a .toy file to use as a parameter for the run script, or by entering it in manually after doing the run script with no parameters and hitting Ctrl+D).

// Hello World!
/*
	Nothing!
*/

If you test some properly formatted comment inputs such as above, you should get absolutely nothing! There should be no warnings or errors if you did it right.

Solution

WHITESPACE WHITESPACE: (' ' | '\r' | '\n' | '\t') -> skip; // There are probably more characters in the ASCII table, but these are the basic ones.
LINE_COMMENT LINE_COMMENT: '//' ~('\r' | '\n') -> skip; // This ends a comment when it runs into either a newline or a carriage return.
COMMENT COMMENT: '/*' .*? '*/' -> skip; // Even simpler than the other comment type, just matches anything between the multiline comment markers. A maybe operator after the any amount operator makes this non-greedy.
Token Ordering While the order of LINE_COMMENT and COMMENT doesn't matter, they both should be over WHITESPACE. This is so a character belonging to WHITESPACE isn't matched by WHITESPACE before it can get to a COMMENT. All of these should be below the pre-existing Rules of course!

Now I hope you at least tried to implement them without looking at the solutions first. If you didn’t try hard enough, you are cheating yourself. Hopefully you got the hang of stuff by now.

Numbers, Decimals, And Strings, Oh My!

This is a fun challenge to see what you can do. This one will get incrementally harder. Recall that Toylet only supports 3 data types: signed numbers, decimal numbers, and strings. We need to define the following items:

Hints: Remember that Fragments are your friends here, and you can define as many Fragments as you like. Also be sure to not use any Token definition in another Token definition. It may be helpful to define a Fragment for each of the different number bases, and a Fragments specifically for character escape sequences. Also note that \\ is how you define a backslash in ANTLR4.

Test Bench

Keep in mind that you can comment out lines that will not work with your grammar yet! Run the following input:

1234567890;
-0d;
12.;
-.12;
1.23;
123.4;
-17;
"Hello World!";
"Hi\n";
"\tVery   \r\nSpacey\\Backslash\0Null\xFFHexNumber";

You should never have any warnings or errors. Your output tree should look like this:

alt text

Solutions

I’m warning you, you are cheating yourself if you are blatantly copying and pasting these. If you are stuck, go through one solution (do them in order) and see if it helps you solve the other parts. Obviously put the Rules above any Tokens or Fragments, and Tokens over any Fragments.

statement // Simply just call our new decimal, number, and string Rules. You have to make sure decimal is in front of number though, so that way it will detect the d suffix. statement : decimal | number | string ;
decimal // Just reference our Token and a semicolon, not much going on here. decimal : DECIMAL ';' ;
number // Just reference our Token and a semicolon, not much going on here. number : NUMBER ';' ;
string // Just reference our Token and a semicolon, not much going on here. string : STRING ';' ;
Fragments For DECIMAL // Just define digits we are allowed to use. fragment DecDigit: [0-9];
More Fragments For NUMBER // Just define some prefixes and other number types a NUMBER could use. Remember that a Fragment can call other ones. Make sure to put the HexDigit above the DecDigit since it uses it. fragment HexDigit: DecDigit | [a-f] | [A-F]; fragment BinDigit: [0-1]; fragment HexPrefix: '0x' | '0X'; fragment BinPrefix: '0b' | '0B';
Fragments For STRING // All the STRING needs is an escape sequence Fragment. Make sure to put this above the Fragments for digit types! Using other Fragments is allowed. fragment EscapeSequence : '\\r' | '\\n' | '\\t' | '\\a' | '\\b' | '\\v' | '\\\\' | '\\"' | '\\\'' | '\\\\x' HexDigit HexDigit ;
NUMBER // You should obviously be defining these tokens above the fragments. This is pretty self-explanatory, just make sure there is at least one digit after the correct prefix, and allow the possibility for negatives. NUMBER : '-'? DecDigit+ | '-'? HexPrefix HexDigit+ | '-'? BinPrefix BinDigit+ ;
DECIMAL // Make sure that a regular number has a suffix after it, and that any possibility can be potentially negative. The way this is defined means there has to be at least one digit before or after the decimal point. Remember that we can only use Fragments and not other Tokens. DECIMAL : '-'? DecDigit+ ('d' | 'D') | '-'? DecDigit+ '.' DecDigit* ('d' | 'D')? | '-'? DecDigit* '.' DecDigit+ ('d' | 'D')? ;
STRING // Match any amount of characters inbetween quotations. Make sure an escape sequence is matched first instead of any item. A maybe operator after the any operator makes this non-greedy. STRING : '"' (EscapeSequence | .)*? '"' ;

Hopefully you were able to figure out a solution on your own or were close to one of my solutions. If you didn’t try, shame on you. Next we will work on variables.

Variables

At the moment, there are only 3 data types: Numbers, Decimals, and Strings. This means that you are only allowed to declare a variable belonging to one of these 3 types. In this next portion, we will define the rules for variable names, declaring variables, and assignment to them. You can delete everything in the statement rule, as well as the number, decimal, and string rules as those were only for testing.

Hints: You should probably define a Fragment that can represent the first character of a string, and another that can represent another character. You should also define 3 different Rules for each of the different expressions that a data type can have. Remember that operator precedence is given by what is on top. Also make sure to keep the is a mindset, as a number expression can be a decimal expression, and both could be in a string expression. Also keep in mind that a variable declaration could either be just declaring a variable or also defining it too.

Test Bench

Remember that you can comment out lines that have not been implemented yet! Run the following test input:

number one1 = 1;
number _is_2;
string @hello = 7.3 + " Hello " + " World " + (one1 + "!");
decimal dec = 7 % (3 - 1) * 2; // Notice how this one is an int expression, but can still be assigned to a decimal value.
number mystery = (number)(3.3d / 19) ^ 5;

You should never have any warnings or errors. Your output tree should look like this:

alt text

Solutions

If you are stuck, go through one solution (do them in order) and see if it helps you solve the other parts. Obviously put the Rules above any Tokens or Fragments, and Tokens over any Fragments. This section was probably one of the most confusing ones, but hopefully you were able to figure out most of it! Keep in mind that order matters with these, so anything that depends on another thing should be above the said thing.

statement // Pretty self explanatory, given all we can do is declare a variable. statement : variable_declaration ;
variable_declaration // We can either just say a variable exists, or actually assign it a value. variable_declaration : variable_parameter OP_ASSIGN expression ';' | variable_parameter ';' ;
variable_parameter // When passing a parameter, we could only have a defined type and the name of it. variable_parameter : variable_type IDENTIFIER ;
Basic Tokens // Operators. OP_ASSIGN: '='; OP_MUL: '*'; OP_DIV: '/'; OP_MOD: '%'; OP_ADD: '+'; OP_SUB: '-'; OP_NOT: '~'; OP_AND: '&'; OP_OR: '|'; OP_XOR: '^'; // Primitives. PRIMITIVE_NUMBER: 'number'; PRIMITIVE_DECIMAL: 'decimal'; PRIMITIVE_STRING: 'string';
variable_type // A type can only be one of our 3 main types. variable_type : PRIMITIVE_NUMBER | PRIMITIVE_DECIMAL | PRIMITIVE_STRING ;
expression // Order is important here. We know one higher on this list could fit into another, so we don't want all expressions to be classified as string ones. expression : number_expression | decimal_expression | string_expression ;
Identifier Fragments // Some basic rules on how we can compose an identifier. fragment IdentifierStart: '@' | '_' | [a-z] | [A-Z]; fragment IdentifierOther: '_' | [a-z] | [A-Z] | DecDigit;
IDENTIFIER // An identifier for a variable or function name. Not much of interest here, should be fairly straightforward. IDENTIFIER : IdentifierStart IdentifierOther* ;
number_expression // This is the most complicated one of them all. Notice how it follows the order of operations, and that ones with equal prescedence are on the same line. Also notice how you are allowed to say a variable or number alone can make this expression. Casts are also satisfied. number_expression : '(' number_expression ')' | '(' PRIMITIVE_NUMBER ')' expression | OP_NOT number_expression | number_expression (OP_MUL | OP_DIV | OP_MOD) number_expression | number_expression (OP_ADD | OP_SUB) number_expression | number_expression OP_AND number_expression | number_expression OP_XOR number_expression | number_expression OP_OR number_expression | IDENTIFIER | NUMBER ;
decimal_expression // This is pretty close to the number expression, except for the fact that it has less operators, and can be a number expression if it needs to be. decimal_expression : '(' decimal_expression ')' | '(' PRIMITIVE_DECIMAL ')' expression | decimal_expression (OP_MUL | OP_DIV) decimal_expression | decimal_expression (OP_ADD | OP_SUB) decimal_expression | number_expression | IDENTIFIER | DECIMAL ;
string_expression // This is one of the simpler ones. You can only add. Also notice that it can be a decimal or number expression if needed, but does not use the expression Rule. This is due to the fact doing so would cause an infinite loop of whether or not to classify something as a string expression or a generic expression. string_expression : '(' string_expression ')' | '(' PRIMITIVE_STRING ')' expression | string_expression OP_ADD string_expression | decimal_expression | number_expression | IDENTIFIER | STRING ;

Yeah, this one was pretty hard. Although once you get this one done, it should hopefully get easier from here, or not. If you were able to figure this one out without too many problems, then you are doing really well. The next part is easier as it just basically extends off of what we have just established.

Functions, Assignments, Loops, And Ifs

Now luckily stuff gets a little less tricky here, as we have the fundamental building blocks. Here is where we will be able to include function definitions and function calls. We will also start work on conditional statements, where any number but 0 is true, with 1 being the default true value.

Hint: Make sure to loop up operator presedence! For our purposes, an expression be directly responsible for comparisons.

Test Bench

Wow, we are using actual code now.

func doThing(number num, string str, decimal dec) {
    number a = num + 5 * 3 - 1;
    a += (number)dec;
    println(str + a);
}

func add(number a, number b) -> number {
    if (a > b) {
        return a + b;
    } elif (!(b <= a) && b) {
        return a + b;
    } else {
        return a + b;
    }
}

doThing(7, "Hello ", 3.3);
for (number n = 0; n < 10; n++)
    println(n);
decimal d = 4;
while (d < 12) {
    d += ((d / 3) < 5) ? 1.0 : 2.0;
}

loop {
    break;
}

TODO!!! TREE!!!

Solutions

This is the big one, where we are making the major points of our language. You may have to think critically about how to define some of these grammars.

statement // A lot of new stuff! statement : variable_declaration | variable_assignment | function_definition | function_call | if_statement | for_loop | while_loop | loop | return_statement | break_statement ;

Custom Types

Surprise! As the director, I decided at the last minute that custom types will be allowed, but very limited.

Test Bench

Not nearly as much practice code this time, this is the home stretch!

struct NameTag {
    number id = 0 * 10;
    string name = "John Doe";
}

struct Properties {
    NameTag nameTag;
    number classId;
}

Properties props;
props.nameTag.id += props.classID * 3;
println(props.nameTag.name);

You should never have any warnings or errors. Your tree will look like this:

alt text

If it matches, you did it!

Solutions

Despite this last minute addition, there actually was not much more to add. Most of it was just slight tweaks of existing rules.

statement // Represents a generic statement. statement : variable_declaration | variable_assignment | function_definition | function_call | struct_definition | if_statement | for_loop | while_loop | loop | return_statement | break_statement ;
struct_definition // All we need to do is make it so we can declare any amount of variables in a body! struct_definition : STRUCT IDENTIFIER '{' variable_declaration* '}' ;
variable_assignment // Now we can do any amount of properties. variable_assignment : IDENTIFIER ('.' IDENTIFIER)* assignment_operators expression ';' ;
variable_type // An identifier is a valid type now. variable_type : PRIMITIVE_NUMBER | PRIMITIVE_DECIMAL | PRIMITIVE_STRING | IDENTIFIER ;
Primitive Expressions // Simply replace IDENTIFIER with this for the string, number, and decimal expressions to allow accessing properties. IDENTIFIER ('.' IDENTIFIER)*

Hopefully this last part was the home stretch. It has been quite the journey developing the grammar, but it is now over.

Further Testing

Conglaturation!!! You have successfully implemented the Toylet language grammer! Now you should be able to parse any Toylet code that is syntactically valid. Try it yourself, and see the pretty trees!

Next

Next thing to do is to set up our C# project to use our ANTLR grammar, and to get ready for LLVM. If you thought this part was hard, well it only gets more confusing: Project Setup