Preface

After my first programming class in high school, I discovered the Code Golf stack exchange community. After exploring the community a bit, I discovered that many members of the code golf community either had developed their own programming languages, or were contributors to the git projects of golfing programming languages. This was a pivotal moment for my future in computer science. I was fascinated by the idea of creating something as powerful as a programming language on your own, and so began my long trek towards studying programming language development in university.

What is this project?

This post is the first in a series of blog posts detailing the creation of a compiled programming language from nothing but the Java standard library. Naturally, there are tools out there for language developers that make their jobs much, much easier. Prime examples of this are Lex and Yacc. However, this project is an exploration of all aspects surrounding building a compiler for your language, including how these tools works. I wish for this to serve as both an educational piece as well as a potential tutorial for those who wish to follow along but perhaps with their own ideas. Without further ado, let’s begin!

Planning

What is BNF?

Throughout this project, we will be using a notation to define language grammar called Backus Naur Form, or BNF for short. At its core, BNF is built on a structure of “x produces y.” Basically, if you say “x produces y” and “z produces xy”, then z is equivalent to yy, as x converts into a y. In BNF, this relationship is written as follows:

<z> ::= <x> "y"
<x> ::= "y"

Notice that y is in quotes while x and z are in angle brackets. This is because z and x may be considered in a sense to be variables, and y may be considered to be a literal. Following the lingo, x and z are called “rules” and y is called a “terminal.” For all intents and purposes, we can think of y as something similar to a base case in recursion. For example, if we have the rule:

<x> ::= "y" | <x> "y"

Then x could be “y”, or “yy”, or “yyy….” This grammar rule defines x as at least one “y,” up to an infinite number of concatenated y’s. If a rule doesn’t have a path to a terminal, then it will infinitely recurse, which isn’t exactly useful for our purposes.

Structure

This project will begin with designing a BNF parser, which we will use to read a BNF grammar file that defines our language’s structure. This file will define everything from what a literal or variable is, how to tokenize the source files, to how we generate our parse/syntax tree. After this step, we need to build a parser definition for our language. We will hopefully convert all tokens from the source file into a set of stack-based instructions created by our Parser. Finally, the compiler will read off these instructions and generate our target language. In this project’s case, it will most likely compiler to either C, C++, or x86 ASM.

The Design

The concept of Tophat has been in the works since about September of 2018, where much of the planning was done with a friend, however we never got to any significant stage of code development. That does not mean, however, that the theory behind the fledgling language has not been worked on. A major niche that was noticed is that there’s a severe lack of mathematics-based domain specific languages which are also compiled. The vast majority are interpreted or compiled to a bytecode which is then interpreted (looking at you, MatLab). While this is excellent for languages with custom editors that require immediate feedback while editing a script, it’s suboptimal if a complex calculation or algorithm must be used. In that situation, you’d likely outsource to a more efficient, compiled language which isn’t specific to the domain of math. This is the core audience I wish to reach with Tophat.

The Paradigm

Tophat will have the option to be implicitly typed. Sometimes, you just don’t feel like fully specifying the type of a variable, and that’s okay. There will be a slight efficiency hit to support this feature, but the syntactic sugar will sweeten the deal enough.
Tophat will be procedural.
The syntax will be inspired by C, but absolutely not super similar.
Semicolon delimited.
Turing complete. We’ll prove this when the time comes.
Heavily optimize storage resources and runtime.
Provide plenty of built in functions for math.

Core Features

Relational Definitions

This feature is inspired in many ways by functional programming languages. Essentially, the issue this concept solves is the situation where you update a variable and wish to complete the same computation. For example, the formula for force is $f=ma$. If you were writing Java code involving this formula and needed to update the formula, you would write:

float f = m * a;
System.out.println(f);
a = a * 2; //For example
f = m * a;
System.out.println(f);

There are a good number of situations where this type of code can be very cumbersome, and particularly in the domain of math isn’t exactly how these relationships are meant to be interpreted. They’re mutable contingent on the idea the inputs are mutable. To prevent issues of infinite recursion, there will be two main safeguards:

Provide a compiler warning if a simple cycle is detected (read: in the same condition branch)
Provide a runtime check for cycle detection using Tarjan’s strongly connected components algorithm
Both of these safeguards will have compiler flags to disable them.

Hashed Array Tree (HAT)

Fun fact: This was the namesake for Tophat!
Despite its name, this data structure is neither hashed nor is it exactly an array. Almost all of the information my friend and I could find on HATs was either 404 links or very poorly documented from the late 1990’s. Essentially, a HAT is a space-optimized dynamic array. While a dynamic array typically wastes $O(2n)$ or $O(\Phi\cdot n)$ space for its scale factor, the HAT wastes at most $O(2\sqrt{n})$ space. The implementation is in essence a 2D array of size $\sqrt{n}\times\sqrt{n}$ where $n$ is a power of 2. When a leaf is filled with elements and another element is inserted, a new leaf will allocate. If the HAT is full, it will reallocate the pointer array to size $2\sqrt{n}$, and it will similarly downsize when the consumed capacity is at $\frac{n}{8}$. We’ll prove these concepts in a later post.

Arbitrary Precision Option

Yikes, arbitrary precision can get very nasty very quickly. It’s slow, doesn’t run well on hardware, and has a multitude of other issues. All of this, however, is a necessary price to pay if I truly need 200 digits of Pi or a prime number in the order of $2^{500}$. A necessary evil, especially for a math-based language. The plan is to use a structure which will select whether an AP number, 8 byte, 4 byte, 2 byte, or 1 byte precision must be used in runtime. Users will be able to override this function by specifying the exact type of number they wish to use.

Conclusion

This is going to be a very long and difficult project. As issues are encountered, it’s likely some of these plans will change. It’s important, however, that all changes are documented and the reasons why they were changed, which in a sense is another purpose of this blog. If you continue to follow this blog, I hope you enjoy!