I started the egg project with an informal language specification in my head and a desire to minimise the number of dependencies on external libraries. Consequently, I developed the language syntax and the parser hand-in-hand. This was the first bad idea.
The parser creates an abstract syntax tree (AST). The ability to create an AST does not guarantee that the program is "well-formed", "correct" or anything else; you need to be able to run the program to verify that what you've got what you expected. So I implemented an interpreter that "ran" the AST and developed it alongside the syntax specification and parser. This was my second bad idea.
The problems didn't rear their heads until I started implementing generators. If you remember, these are co-routines that generate sequences of values. At present, there is no standard way of writing portable co-routines in C++. However, there is an experimental proposal for C++20. One alternative is to write a special form of the egg interpreter to execute generator functions (and only generator functions) in a "continuable" manner. Another alternative is to automatically re-write the generator functions as finite state machines. The issue with the first alternative is that it greatly increases the testing burden because you now have two interpreters that must perform consistently. The issue with the second alternative is that it's just plain hard!
But, anyway, I started down the road of the automatic generator function re-writer and quickly came up against two major obstacles:
- What was I converting from?
- What was I converting to?
Using ASTs to represent chunks of executable code is just daft; the level of abstraction is all wrong. What I needed was a virtual machine "intermediate representation". Sigh.
This was when I realised my first bad idea was to design the egg language syntax alongside the parser. The developer is just too tempted to use existing portions of the parser to implement new language features instead of taking a step back and saying "What would make most sense for a learner (or practitioner) of this language?"
My second bad idea was to interpret the AST directly and this turned out to be far more serious. The implementation of the execution subsystem pollutes that of the parser and vice versa. I intend to re-write the parser in egg (dogfooding) as soon as possible, and extricating the execution-only portion in C++ would have been a nightmare.
The egg virtual machine is called "ovum" and is extremely high-level. This is a deliberate design decision enabling us to cross-compile egg modules to other computer languages and still be readable by humans. In fact, you can "compile" an egg module, throw away the source, "de-compile" it to egg and get back pretty much what you fed in, but nicely formatted and without the comments. Ovum doesn't even dictate whether the underlying "machine" is stack-based, register-based, or whatever.
The external representation is a naturally-compact byte code. The internal presentation is a tree, not of syntax, but of opcodes. The opcodes describe the semantics of the program; there are far more of them than keywords in the egg language because context is everything. Therefore, a task such as scanning the program for variable declarations is much more simple that via the AST.
I took the execution re-write as an opportunity to refactor some troubling aspects of the previous implementation caused by my two bad ideas. However, this means that the parser and the compiler now use a different type system to the run-time! But, as I mentioned, I'm hoping to throw away the parser and compiler in the near future and have them running on the ovum VM.
The extensive test suites I've been maintaining, including dozens of example scripts, mean I'm fairly confident that everything is still working as expected. I'm back to the stage where I need to implement generator functions, and at least now it's a VM I'm working with and not an AST! But I did lose a few weeks.
Here's an old meme to cheer us all up:
Oh, the hue manatee! |
No comments:
Post a Comment