This is a continuation of my last post (same disclaimers, even more relevant this time!).
As features for this new version, I wanted static types, flow control, and generally a better architecture for it to be more easily extendable. Previously dubbed "The Second", this was not a very accurate name. While v2.0 was a complete rebuild, I hadn't fully figured out what the new architecture needed to look like. I did make some significant improvements in some parts, but others looked just like a refactoring of the old code. The main areas of improvement namely were types and tokenization. The latter allowed for fancy exceptions which made debugging much easier. In my first post, I described how I removed empty lines within brackets. That didn't alter the semantics of the code, but it changed the shape meaning line and column numbers were off. To combat this, the new tokenization was simply no longer dependent on the EOL character.
Some tests and errors
1)
vec3 a = [1,2,3]
vec3 b = [4,5,6]
a = a + b + c
ExpressionParserTyped.Core.Exceptions.BASE_EXCEPTION:
+----------------------------------------------------------------
| SyntaxError in line 2.1:
| vec3 b = [4,5,6]
| ^^^^
| | Expected ;
+----------------------------------------------------------------
2)
vec3 a = [1,2,3];
vec3 b = [4,5,6];
a = a + b + c;
ExpressionParserTyped.Core.Exceptions.BASE_EXCEPTION:
+----------------------------------------------------------------
| MemoryError in line 3.13:
| a = a + b + c;
| ^
| | Could not find 'c' in memory
+----------------------------------------------------------------
3)
vec3 a = [1,2,3];
vec3 b = [4,5,6];
vec2 c = [0,0];
a = a + b + c;
ExpressionParserTyped.Core.Exceptions.BASE_EXCEPTION:
+----------------------------------------------------------------
| MathError in line 4.13:
| a = a + b + c;
| ^
| | Operation not defined MXVector3 + MXVector2
+----------------------------------------------------------------
While I added casting for some types, it seems I forgot some. All this shinyness is great! But it distracts from the fact that two out of three requirements were not fulfilled. Sure, it was able to do some basic math with integers, floats, vectors, and matrices, I could now write functions in the code itself and it had some pretty errors, but it still had no flow control, and at this point also no integration with Maya. And even though I built this version in a more extendable way, this was only true for binary atomic operations, not overall new expressions (or even unary operators). But that's exactly what I needed for statements like conditions or loops. So back to the drawing board...
Writing a toy language... or something?
My requirements were still the same but I had to do something fundamentally different. At that point, it occurred to me that the features I intended to implement were not necessarily part of an expression parser, after all, there is a difference between expressions and statements, but instead more something you'd expect from a simple programming language. Something one will inevitably come across at this point is LLVM. Given that I didn't really know what I had to do, switching to a new tool was unlikely going to solve that problem. I might revisit this later, we'll see.
In the previous post, I briefly mentioned grammar. In Don's implementation, you can find a comment describing his grammar at the top of the file. I needed a function that could parse the code based on a given grammar (turns out, that's how Python works).
Instead of using a third part tool, I wanted to see if I could come up with a solution myself. After spending lots of time aimlessly searching the internet, reading about regex and grammar, and playing around with a modified regex clone hoping to get recursion to work for my purposes, I had a better idea. Since I didn't need something super complex, I could go with a much simpler tokenization than I had before. In the v1.5 I implemented function definitions. Each function was just a single token encapsulating everything inside. So why not do the same here? I needed expressions, assignments, if-statements, and loops for now. I also reworked the expression token to include unary pre and post-operators.
# binary, unary prefix and unary postfix operators
OP_BIN = f"(?:{'|'.join(Operators.get_binary_escaped())})"
OP_PRE = f"(?:{'|'.join(Operators.get_prefix_escaped())})"
OP_POST = f"(?:{'|'.join(Operators.get_postfix_escaped())})"
# atom as we know it
ATOM = rf"(?:{ATTR}|{VARIABLE}|{MATRIX}|{VECTOR}|{NUMERIC})"
# an can now be anything from an atom to several binary and unary
# operations on as many atoms as we'd like
_ATOM_OP = rf"\(*?(?:{OP_PRE})?\(*?{ATOM}\)*?(?:{OP_POST})?\)*?"
EXPRESSION = f"(?:{_ATOM_OP} *(?:{OPERATOR_BINARY} *{_ATOM_OP})*)"
# assignments
_DECLARE = f"(?:{VAR_TYPES} *{VARIABLE} *;)"
_VAR_SET = f"(?:{VAR} *= *{EXPRESSION} *;)"
_ATTR_SET = f"(?:{ATTR} *= *{EXPRESSION} *;)"
_DECLARE_SET = f"(?:(const *)?{VAR_TYPES} *{VAR} *= *{EXPRESSION} *;)"
ASSIGNMENT = f"(?:{_DECLARE_SET}|{_DECLARE}|{_ATTR_SET}|{_VAR_SET})"
# if statements
# with the if, its important to check if else before checking if
# as any if-else would also match an if
IF = rf"(?:if *\( *{EXPRESSION} *\) *{{.*}}(?: *else *{{.*}})?;)"
# while loops
WHILE = rf"(?:while *\( *{EXPRESSION} *\) *{{.*}};)"
# Skipping the for loop as there is no way I can fit this in here...^^
Anyone paying close attention to the regex might notice there are no newline characters to be found anywhere. That is done on purpose. Relying on stripping new lines from brackets is a step backward, but for now, the simple tokenization meant I had to make some compromises. This of course comes with its own challenges. Let's take this code as an example:
float x = 3;
float y = 0;
float z = 0;
while(x > 0){
x = x - 1;
y = y + 1;
};
while(y > 0)
{
y = y - 1;
z = z + 1;
};
After preprocessing the newlines in the brackets, it looks like this:
float x = 3;
float y = 0;
float z = 0;
while(x > 0){ x = x - 1; y = y + 1;}
while(y > 0)
{ y = y - 1; z = z + 1;}
The first four lines will fully match against a pattern, but there would be an error in the fifth. To solve this, if there is no match, I simply add the next line while there is no match. This works great unless the first line is a valid regex but the second was meant to go with the first like here:
if(x > 10) { y = 10;}
else { y = 0;}
This can be solved by checking the lines in reverse order until there is a valid expression. But this method is not bulletproof. For now, it's enough to take that new expression, collapse the brackets, and split at the line separation characters. Though this has served me well so far, I'd like to revisit this entire system eventually and get the approach from v1.5 working here.
With all this in place, it was much easier to build an actual AST. As with the attempts before, I implemented the math portion first. With a more general structure in place, I was inclined to use the expression parsing approach from the first version, but that was easier said than done as that did not account for any unary operators other than the negative sign. Writing this portion actually turned out to be much less of a hassle than I expected! In simple terms, I am parsing through the expression splitting at each operator in reverse precedence order. recursively. Out of that, I get two lists. One containing the left and right sides and one the operators. As some operators serve a double purpose as unary and binary operators, I am running a cleanup pass that finds the empty string in the list and stitches together what was broken apart:
# -5-x*(3!)+2
>>> ["", "5", "x*(3!)", "2"], ["-", "-", "+"]
# after cleanup
>>> ["-5", "x*(3!)", "2"], ["-", "+"]
Next, call the function again on each element on the left side but this time, look at the operator with the next higher precedence.
There is of course a bit more going on to ensure brackets are not broken apart, but this is the essential part. Using this, I created the first node for my AST. The nodes don't have to match the tokens, but they can. Eg, I could have added one node for variables, one for constant values, unary operations, binary operations, etc. I decided to not do any of those. I created one node type for each token. So NodeExpression, NodeAssignment, NodeIf, NodeFor, NodeWhile. This meant that the NodeExpression got a bit more complicated and I would probably do this differently next time around. For now, it works well. And as complex the NodeExpression is, as simple were the other ones. The AST for this particular expression looks like this
Symbol: + (type:float, const:False)
left: Symbol: - (type:float, const:False)
left: -5 (type:float, const:True)
right: Symbol: * (type:float, const:False)
left: x (type:float, const:False)
right: Symbol: ! (type:float, const:True)
value: 3 (type:float, const:True)
right: 2 (type:float, const:True)
Each value, such as x, 2, 3, and -5 is printed directly together with their type as well as if they are constant or not (I will get back to the const below). Non-value nodes print their operator, type, and if they are constant or not followed by their operand(s) in the next line(s).
Next up, I implemented the remaining nodes. building them was much easier as I could use my utility functions to find matching brackets and extract the expressions and assignments from there. From there, everything was just recursive calls.
As a quick recap, so far the tokenizer was redone and the AST generation was redone, what's missing is the interpretation portion... I didn't like the way that worked previously. While I got it to work, it was not good. As there was no real AST, the Expression.* functions all returned custom types from MathType. These had more logic in them than they should have. Likely due to me sticking to Don's approach for too long. As mentioned, his purpose was different from mine and while the approach was appropriate in his case, it wasn't in mine. This time, I wanted something simpler.
My wrangle node was always in the back of my head during all of this. Nothing specific, but I figured it might be worth writing something that I could reuse at a later point so ideally this would not be too intertwined with the regex, types or parsing code so far. In my regex/grammar experiments above, I came across the postfix notation for regex. Their example was simple, "a+b" would become "a b +". This also works for math, "take 4', "take 5", and "add what you have". Based on that, what I came up with was a list of very simple instructions that another program could go through one by one and execute. I called this my CPU as it would do the actual calculations. I learned after, that's basically what a virtual machine (VM) is. It also took me a few hours to notice, but at least from the outside, the list of instructions together with the stack for the memory looked an awful lot like the few bits and pieces I remembered from my high school comp-sci class about assembly. There are of course some key differences between what I was building and real assembly for hardware. Please keep that in mind as I will refer to x86 assembly from time to time. There are also a few instances in which looking up assembly earlier would have been beneficial.
The idea was for the AST to generate a list of instructions that could then be executed by the VM. As I understand, this seems to be very roughly how Python works as well.
To start, I added a handful of instructions to my CPU:
ADD ; add the top two values one the stack as floats
SUB ; subtract the last value from the second last value as float
MUL ; multiply the top two values on the stack as floats
DIV ; divide the second last value by the last value as float
PUSH $ ; put a constant value on the stack
POP ; pop the last element from the stack
LOAD $ ; load a variable from memory
STORE $ ; pop the last element from the stack and store it in memory
The implementation for this is very simple as well:
(my terminology for memory and stack is probably completely off from anything happening in an actual processor. Emulating that was not my goal)
class VirtualCPU:
_stack = None
_memory = None
def __init__(self, instructions):
self._stack = []
self._memory = {}
for cmd in instructions:
# push a constant value
if cmd.startswith("PUSH "):
self.cmd_push(cmd[5:])
# load a variable
elif cmd.startswith("LOAD "):
self.cmd_push(str(self._memory[cmd[5:]]))
# store a variable
elif cmd.startswith("STORE "):
self._memory[cmd[6:]] = self.cmd_pop()
# run any other command
else:
getattr(self, f"cmd_{cmd.lower()}")()
def cmd_push(self, value):
self._stack.append(str(value))
def cmd_pop(self):
return float(self._stack.pop())
def cmd_add(self):
self.cmd_push(str(self.cmd_pop() + self.cmd_pop()))
def cmd_sub(self):
second = self.cmd_pop()
first = self.cmd_pop()
self.cmd_push(str(first-second))
def cmd_mul(self):
self.cmd_push(str(self.cmd_pop() * self.cmd_pop()))
def cmd_div(self):
second = self.cmd_pop()
first = self.cmd_pop()
self.cmd_push(str(first/second))
def cmd_factorial(self):
self.cmd_push(math.factorial(self.cmd_pop()))
A real CPU does not have the concept of data types. Everything is just 0s and 1s. Giving meaning to these values happens at a higher level stage. For example, the binary value 01000001 could either be interpreted as the integer 65. Or it could be interpreted as A as defined in the ASCII table. In the C-language, 'A'+10 is a valid expression and gives 75 when assigned to a variable of type integer, or K when assigned to a variable of type char. On the hardware level and in memory, it's still just 0s and 1s.
In a similar way, in the VirtualCPU, every value and instruction is a string, regardless of its type. This is not strictly necessary as an instruction could be a tuple consisting of the command and the value. To me though, the consistency and ability to serialize the generated "assembly" code easily into a text file for sending off to another VirtualCPU was worth more. The VirtualCPU doesn't do bitwise operations like a real one would at its core, instead, each instruction converts the string into the appropriate representation it can execute the operation with and converts the result back into a string.
For convenience, I added a FACTORIAL instruction as well. So looking at the example from above, the instructions are:
PUSH -5
LOAD x
PUSH 3
FACTORIAL
MUL
SUB
PUSH 2
ADD
Great! To run this, the memory needs an x variable of course, but that's easy. After that, the VirtualCPU can already run these instructions. For x=2, the last remaining element in the stack is -15.0. If you want to try this yourself, you can also put STORE x as another instruction. The stack will then be empty but your memory will contain x=-15.0.
The next step from here was vector and matrix instructions. While I could have implemented this using the existing instruction set to some extent already, this would been too tedious and I am not (that) insane :)
After that was complete, I had a pretty fancy calculator. But, I still didn't have any way of controlling the flow of my program!! It just runs through from start to finish. I did remember JMP (jump), JZ (jump-if-zero), and JNZ (jump-if-not-zero) from school but I didn't know how this was implemented. It turns out processors have a little bit of fast memory available to them called registers. One of those registers is the program counter (PC) or (extended) instruction pointer (EIP). Adding that to the VirtualCPU is straight forward:
self._register["PC"] = 0
while self._register["PC"] < len(instructions):
# fetch the instrction and update the program counter
cmd = instructions[self._regsiter["PC"]]
self._regsiter["PC"] += 1
Implementing JMP $ instructions is now as easy as updating the PC register. In my case, I am not setting any registers based on previous operations like a real CPU, instead, conditional jumps look at the last value in the stack.
while(x<5){
x=x+1;
}
turns into
LOAD x ; this part forms the condition (x<5) <------+
PUSH 5 ; |
LT ; <- this also puts the result on the stack |
JZ 6 ; if false, exit -+ |
LOAD x ; | |
PUSH 1 ; | |
ADD ; | |
STORE x ; | |
JMP -8 ; loop ---------- | ---------------------------+
; <----------+
The loop will run for as long as x is less than 5. If the condition is false (=0), the code jumps out of the loop. If not, it runs through the loop and ends by jumping back to the condition. "for"-loops work in the same way with the exception that they run their init statement once first and their update statement before jumping into the next loop.
As a last topic for this post, I briefly want to come back to the meaning of the const values/variables. In a lot of programming languages variables can be declared as constants. This allows for some optimizations at compile time. Given the example from above, if x was a constant value, there was no need for the VirtualCPU to run through all the instructions all the time. Instead, it could calculate the result once at "compile time" and then push the result of that. While x is not constant here, I spent my flights from and to SIGGRAPH trying to remember some high school math and came up with a way of shuffling AST nodes around that would not alter the outcome, but produce more constant branches that could be simplified.
Looking at the example from above again, let's look at the top 3 nodes:
-5 - x*(3!) + 2
---------------------
c1: -5
v1: x*(3!)
c2: 2
Currently its:
((c1 - v1) + c2)
and none of it contains only constant terms that could be precomputed
But it can be rewritten into:
((c1 + c2) - v1)
Now (c1 + c2) is a constant node and can be precalculated:
(c3 - v1)
Instead of calculating two variable terms before, now the CPU only needs to calculate one at runtime
I managed to find optimizations for a few of these cases. There are two main ones I dealt with. Both with exactly two constant values. Anything with fewer can't get optimized and anything with more will already be optimized when it reaches this step.
((const op var) op const))
((const op var) op (const op var))
Given that there are 4 configurations for both as well as 4 different operators, I believe this gives 64 permutations for the first and 256 permutations for the second. For the former, I was able to optimize 32. The dealbreaker here is mul/div operators between one const and the variable and add/sub with the other const (like ((5*/x)+-3). For the latter, I only managed to get 32 as well. I don't think I can optimize terms that mix add/sub with mul/div operations such as (3*x)+(2*y), (3+x)*(2+y), or (3+x)+(2*y).
Notice that the (3!) is also a constant and therefore got simplified as well. The optimization runs through all the nodes in the AST until there are no more optimizations to be found.
ast = sast.simplify()
Symbol: - (type:float, const:False)
left: -3.0 (type:float, const:True)
right: Symbol: * (type:float, const:False)
left: x (type:float, const:False)
right: 6 (type:float, const:True)
With the simplified instructions:
PUSH -3.0
LOAD x
PUSH 6
MUL
SUB
Using constants also allows for pruning of if-branches or loops in certain situations.
Next up are functions/procedures for the VirtualCPU as well as implementing the MPU (Maya-Processing-Unit :D). Stay tuned!
Cheers!
Comments