Wednesday, 19 September 2018

egg Virtual Machine 1

It's been nearly two months since my last post on egg. I did have a couple of breaks but the main delay was a substantial re-write due to a blind alley I stumbled down. It was my own fault...

I started the egg project with an informal language specification in my head and a desire to minimise the number of dependencies on external libraries. Consequently, I developed the language syntax and the parser hand-in-hand. This was the first bad idea.

The parser creates an abstract syntax tree (AST). The ability to create an AST does not guarantee that the program is "well-formed", "correct" or anything else; you need to be able to run the program to verify that what you've got what you expected. So I implemented an interpreter that "ran" the AST and developed it alongside the syntax specification and parser. This was my second bad idea.

The problems didn't rear their heads until I started implementing generators. If you remember, these are co-routines that generate sequences of values. At present, there is no standard way of writing portable co-routines in C++. However, there is an experimental proposal for C++20. One alternative is to write a special form of the egg interpreter to execute generator functions (and only generator functions) in a "continuable" manner. Another alternative is to automatically re-write the generator functions as finite state machines. The issue with the first alternative is that it greatly increases the testing burden because you now have two interpreters that must perform consistently. The issue with the second alternative is that it's just plain hard!

But, anyway, I started down the road of the automatic generator function re-writer and quickly came up against two major obstacles:
  1. What was I converting from?
  2. What was I converting to?
Using ASTs to represent chunks of executable code is just daft; the level of abstraction is all wrong. What I needed was a virtual machine "intermediate representation". Sigh.

This was when I realised my first bad idea was to design the egg language syntax alongside the parser. The developer is just too tempted to use existing portions of the parser to implement new language features instead of taking a step back and saying "What would make most sense for a learner (or practitioner) of this language?"

My second bad idea was to interpret the AST directly and this turned out to be far more serious. The implementation of the execution subsystem pollutes that of the parser and vice versa. I intend to re-write the parser in egg (dogfooding) as soon as possible, and extricating the execution-only portion in C++ would have been a nightmare.

The egg virtual machine is called "ovum" and is extremely high-level. This is a deliberate design decision enabling us to cross-compile egg modules to other computer languages and still be readable by humans. In fact, you can "compile" an egg module, throw away the source, "de-compile" it to egg and get back pretty much what you fed in, but nicely formatted and without the comments. Ovum doesn't even dictate whether the underlying "machine" is stack-based, register-based, or whatever.

The external representation is a naturally-compact byte code. The internal presentation is a tree, not of syntax, but of opcodes. The opcodes describe the semantics of the program; there are far more of them than keywords in the egg language because context is everything. Therefore, a task such as scanning the program for variable declarations is much more simple that via the AST.

I took the execution re-write as an opportunity to refactor some troubling aspects of the previous implementation caused by my two bad ideas. However, this means that the parser and the compiler now use a different type system to the run-time! But, as I mentioned, I'm hoping to throw away the parser and compiler in the near future and have them running on the ovum VM.

The extensive test suites I've been maintaining, including dozens of example scripts, mean I'm fairly confident that everything is still working as expected. I'm back to the stage where I need to implement generator functions, and at least now it's a VM I'm working with and not an AST! But I did lose a few weeks.

Here's an old meme to cheer us all up:

Oh, the hue manatee!

Saturday, 28 July 2018

egg Sequences 3

Sequences (as introduced earlier) are a good way to abstract lists of homogeneous data. We are only allowed to see one piece of data at a time and we don't know (until we get there) where the end of the data is. Indeed, sequences can be infinite.

This promotes the use of streaming algorithms which allow us to process very large data sets (i.e. ones that cannot all fit into memory at one time).

However, there is a hidden danger here:
Sequences may promote sequential processing of data
The hint's in the name!

Sequential processing on today's multi-core computers is rightly frowned upon. We should be writing code that is either explicitly parallel or gives the compiler and/or run-time the opportunity to parallelize. But sequential list processing, where you primarily only process the head of the list before moving on to the tail elements, goes all the way back to Lisp; over sixty years ago! It's no wonder these concepts are so deeply ingrained.

Sequences are often reduced via accumulation. As Guy Steel hints at in "Four Solutions to a Trivial Problem", accumulation can be difficult to automatically parallelize. The abandoned Fortress project was an attempt to mitigate this.

The egg language (not in the same league!) is designed to be easy to learn, but I would like some aspects of concurrency and parallelization to be accessible, even if only as a taste of more complex paradigms. Microsoft's PLINQ demonstrates that it is possible to build parallel execution on top of sequences, but I need to do a lot more research in this area. In particular, do I need to worry about this as part of the language specification as opposed to a library, like PLINQ?

Friday, 27 July 2018

egg Sequences 2

Last time we saw how various curly-brace programming languages deal with sequences, if at all. Now we'll concentrate on sequence generation.

Generators

In these examples, I'll use JavaScript because the syntax is more concise, but similar effects have be achieved with C#.

Consider this code:

  // JavaScript
  function* countdown() {
    yield 10;
    yield 9;
    yield 8;
    yield 7;
    yield 6;
    yield 5;
    yield 4;
    yield 3;
    yield 2;
    yield 1;
  }
  for (var i of countdown()) {
    console.log(i);
  }

This obviously counts down from 10 to 1. (I say "obviously" but that wasn't strictly true for me. I used "in" instead of "of" in the for loop and got nothing out. Good luck, learners!)

The first improvement we'd probably want to make is to use a loop:

  // JavaScript
  function* countdown() {
    for (var n = 10; n > 0; --n) {
      yield n;
    }
  }
  for (var i of countdown()) {
    console.log(i);
  }

Next, we could parameterize the countdown:

  // JavaScript
  function* countdown(x) {
    for (var n = x; n > 0; --n) {
      yield n;
    }
  }
  for (var i of countdown(10)) {
    console.log(i);
  }

Nice. But what exactly is "countdown" and what does it return? If you bring up a Chrome console and type in the last snippet, you can get useful insights:


So, "countdown" is a function; no surprise there! But a call to "countdown(10)" produces a suspended "Generator" object with a couple of scopes. This gives an insight into what is happening under the hood.

A JavaScript generator function returns a object that implements the iterator protocol. This consists of the "next()" member function, which, in turn, allows the use of "for ... of" loops and spread syntax:

  > console.log(...countdown(10))
  10 9 8 7 6 5 4 3 2 1

Yield

The JavaScript yield operator differs from the return statement in a couple of fundamental ways.

Firstly, "yield" is resumable. If a function terminates with a "return" statement (including an implicit return at the end), that's it: game over; the function has finished. With a yield, the function may resume at a later date. This implies that the state of the function (e.g. local variables, exception try blocks, etc.) must be preserved when a "yield" is executed so that the generator can continue exactly where it left off. This is a coroutine.

Secondly, a JavaScript "yield" is an operator that produces a value. It is an expression. This allows the caller to pass information back into the iterator function. A "return" is a statement.

Coroutines

If generator functions produce coroutines, what sort of coroutines do they produce?

The various flavours of coroutines are discussed at length in the following papers:

Symmetric versus Asymmetric

Symmetric coroutines can yield to any arbitrary coroutine; asymmetric coroutines can only yield to their "caller". It has been shown that you can build symmetric coroutines from asymmetric ones, and vice versa. Asymmetric coroutines are generally considered to be easier for humans to understand; for one thing, they maintain the notion of caller and callee. Both JavaScript and C# implement asymmetric coroutines.

Stackful versus Non-stackful

Stackful coroutines can yield anywhere, including from functions called by coroutine. Non-stackful coroutines can only can only suspend their execution when the control stack is at the same level that it was at creation time. Non-stackful coroutines have the advantage that yields are unambiguous in cases where one co-routine has explicitly called another. Both JavaScript and C# implement non-stackful coroutines.

Internal-only versus Externalised

This only applies to coroutines that are generators. Internal-only yields can only be used inside foreach-like statements; for example, CLU iterators are internal-only. Externalised yields provide an interface for calling the coroutine at arbitrary call-sites. Both JavaScript and C# implement externalised yields (via the "iterator protocol" and "IEnumerator interface" respectively).

Unidirectional versus Bidirectional

This is my own nomenclature. Unidirectional coroutines use yield statements; i.e. no value can be passed from the caller back to the callee upon resumption. Bidirectional coroutines use yield expressions. In reality, bidirectional coroutines can be approximated using unidirectional ones:

  // JavaScript: bidirectional
  function* generator() {
    // blah blah
    var y = yield x; // caller returns 'y'
    // blah blah
  }

  // JavaScript: unidirectional
  function* generator() {
    // blah blah
    var xy = { x: x };
    yield xy; // caller adds 'xy.y'
    var y = xy.y;
    // blah blah
  }

JavaScript implements bidirectional coroutines; C# implements unidirectional coroutines.

Implementation

Under the hood, calls to JavaScript generator functions create objects that encapsulate the logic of the body of the function and maintain the state. We could do this by hand:

  // JavaScript
  function countdown(x) {
    var n = x;
    return {
       next: function() {
         return (n > 0) ? {value: n--, done: false}
                        : {done: true};
       }
    };
  }

But this is tedious and error-prone, not to mention ugly. Another option is to convert the generator body to resumable code automatically. This usually entails producing a finite state machine that replicates the functionality. There are at least two tools that do this:
  1. Google's Traceur, and
  2. Facebook's Regenerator.
Both produce genuinely scary code.

Perhaps the best solution is to support resumable function bodies natively. On code designed to run on a virtual machine, this is not so difficult. However, for efficiency, generators will probably be run via a different (stackless) code path from standard function calls. This is what the current egg implementation does, and I suspect this is also true for Chrome's generators too. C# on the other hand rewrites the function.

Tuesday, 24 July 2018

egg Sequences 1

Sequences of items are a recurring theme in programming, but different computer languages have differing levels of support for them. For the purposes of these posts, I'm going to define a sequence as:
  1. Zero or more homogeneous elements
  2. Elements may be repeated
  3. Elements have a position (or positions, in the case of repeated elements) within the sequence
  4. The only operation allowed for a consumer is to fetch the next element; this may fail if we have reached the end of the sequence
  5. Sequences may be endless
This definition is known by many names in different languages (lists, enumerations, streams, iterations, etc.) but I'll stick with "sequence" to avoid ambiguity.

Let's break down the definition. Point 1 sounds quite restrictive: what if we want a sequence of integers and strings? This isn't really a problem in a language with composable types; we just define the type of the elements in our sequence to be the union of all the expected, individual types. In egg, we could use the hold-all "any" (or "any?" if we want nulls too).

Point 4 hints that, from a programming point of view, the consumer of a sequence doesn't see very much. But what does it need to see?

Support in Existing Languages

Rust has a very good interface for what it calls iterators. It has a single member function that returns the next element or nothing:

    // Rust
    fn next(&mut self) -> Option;

JavaScript has a similar protocol. The single member returns an object with two fields ("value", which may be undefined, and "done", which is a Boolean):

    // JavaScript
    function next();

Other languages are, in my opinion, a bit inferior in this respect. C# defines an enumerator interface with three members:

    // C#
    public T Current { get; }
    public bool MoveNext();
    public void Reset();

My two concerns with C#'s "IEnumerator<T>" are:
  • Getting the next element is not an atomic operation: you need to move to the next element and then read it.
  • The "Reset" method suggests that sequences are always "replayable". This is rarely the case (see the comment in the source).
In Java 8, sequences are named "iterators". The interface has four members:

    // Java
    default void forEachRemaining(Consumer action)
    boolean hasNext();
    T next();
    default void remove();

The first is a helper that can be overridden to optimise sequence operations. The next two exhibit the same atomicity problem that C#'s interface has. The final "remove" method is just plain strange. Needless to say, the default implementation throws a "not supported" exception.

C++ iterators are heavily templated and do not, in general, work via interfaces (though see Boost for some work towards this). Iterators in C++ are often bi-directional.

Atomicity

As I mentioned above, C# and Java have potential atomicity problems. In the vast majority of cases, these never appear or can be avoided by using additional locks. However, sequences can be a useful building block in concurrent systems (e.g. CSP), so native atomicity is a desirable feature.

For example, imagine a sequence of tasks being used as a queue of work for two or more consumers. If the consumers cannot atomically fetch the next work item from the sequence, there is the potential for work items to be accidentally skipped or even performed twice.

For that reason, if written in C++17, one interface for sequences could be:

    // C++
    template<typename T> class Sequence {
      virtual ~Sequence() {}
      virtual std::optional<T> next() = 0;
    };

This is effectively the Rust interface above.

Sequence Functions in egg

In egg, we can simplify this interface (for integers) to a single function signature:

    // egg
    void|int next()

We could have used the following:

    // egg
    int? next()

But that would mean that sequences could not contain null values.

The type of a sequence function for integers is therefore "(void|int)()" which is somewhat ugly. I originally thought that abbreviating this to "int..." would be quite intuitive, but quickly ran into syntax problems (see later). At present, I am toying with the idea of using "int!" as a shorthand, based on some fuzzy notion that other languages using "!" when dealing with channels.

Thus, if a function took two integer sequences and merged them in some way, the signature could be:

    // egg
    int! merge(int! left, int! right)

This implies that sequences are first-class entities: what R. P. James, and A. Sabry call "externalised yields" that can be used outside of the "for-each" control statement.

However, this only deals with consumers of sequences. How does one produce a sequence in the first place?

Sequence Generators

One way to create a sequence is via a container that permits iteration. Most languages that support the "for-each" control statement allow you to iterate around elements in an array, for example.

    // JavaScript
    var a = [10, 20, 30];
    for (var v of a) {
      process(v);
    }

However, one of the strengths of sequences is that the elements can be generated on-demand or lazily. Indeed, some sequences are infinite, so populating a container with all the elements of a sequence may be impractical or impossible.

Both C# and JavaScript have the "yield" keyword. C, C++ and Java have non-standard mechanisms for achieving similar effects, but they are not part of the language.

Here's a generator function to produce the infinite, zero-based, Fibonacci sequence in JavaScript:

    function* fibonacci() {
      var a = 0;
      var b = 1;
      for (;;) {  
        yield a;
        var c = a + b;
        a = b;
        b = c;
      }
    }

And in C#:

    public IEnumerable<int> fibonacci() {
      int a = 0;
      int b = 1;
      for (;;) {
        yield return a;
        int c = a + b;
        a = b;
        b = c;
      }
    }

The syntax is slightly different, but the fundamentals appear similar. Surely that means that generator functions are trivial to implement? Alas, no.

Next time, I'll discuss why generators are so complicated under the surface. In the mean-time, might I again suggest this paper as some bed-time reading?

Friday, 6 July 2018

egg Garbage Collection 2

In the first part of this thread, I introduced a less intrusive tracing garbage collector that I'm working on for the egg programming language. One of the questions I left hanging was determining which nodes in the basket are "roots". That is, which nodes are pointed to directly from outside the basket?

Roots and Concurrency

At some stage in the future, I'd like to switch the garbage collection to be concurrent. This poses additional problems. Imagine you are inserting a parent node and a child node into the basket and the parent is a root. How do you do this whilst the garbage collector is concurrently running? You have to be very careful of the order of operation so that the collector doesn't accidentally evict your nodes before you've had the chance to finalise all the links.

One solution is to ensure all the new nodes are roots and then downgrade most of them after linking them together. This implies that downgrading nodes, from root to non-root, needs to be an efficient operation.

Another potential requirement is "locking" nodes when a subsystem other than the garbage collector is using them; you don't want the collector to pull the rug from under their feet.

Hard and Soft References

To try to solve these issues, I've been experimenting with "soft" and "hard" references. These shouldn't be confused with "weak" references; that's an orthogonal concern. A soft reference is a link between two nodes in the same basket; these are the links that the tracing garbage collector follows. A hard reference is a link to a node that uses traditional reference counting. Unlike soft references, hard references do not need to know which node they are pointing from, only where they are pointing to. Indeed, hard references do not even need to be to or from nodes within a basket. In these respects, hard references are similar to "std::shared_ptr" in C++.

In egg, the "Collectable" base class can be the target of both soft and hard references. To be the target of a soft reference, the "Collectable" node must be a member of exactly one basket. When the node was added to it's basket, the basket obtained a single hard reference to it. This single hard reference is maintained no matter how many soft references there are to it from within the basket; it is only relinquished when the node is evicted from the basket.

Here's an overview:
  • When a node is added to a basket, the basket takes a hard reference to it.
  • Nodes can have zero or more additional hard references to them.
  • The garbage collector only considers nodes for eviction if they have a hard reference count equal to one, i.e. the only hard reference to them is from the basket.
  • Nodes that have hard references in addition to the single reference from the basket are considered "roots" and are effectively locked.
  • Nodes can only be added to a basket by supplying a hard reference. This overcomes the race condition causing premature eviction of partially constructed networks.
  • Nodes can be made a "root" simply by creating an external hard reference to it. When the last external hard reference is lost, the node is no longer a "root" and is a candidate for eviction from the basket if it is not accessible from some other root.

Practical Considerations

When I started implementing soft references, I found they are incredibly difficult to construct properly. The reason is that soft references need to know both the source and target of the reference. If you are constructing an instance, "source", of a class that contains a soft reference to another node, "target", you can easily end up creating a reference between "source" and the partially constructed "target". If an exception is thrown within the constructor (or a function called from it) it can be very difficult to untangle the links safely.

The solution I'm using at the moment creates all the links as hard references and then demotes them to soft references later on. This has the added advantage of not adding nodes to the basket at all in the event of an exception being thrown part-way through the initial creation of the network.

Another facility I added to make constructing soft references less error-prone is "basket inference". Usually, the sequence of events is:
  1. Add the target to the basket, if it's not already added
  2. Add the source to the basket, if it's not already added
  3. Create a soft reference from the source to the target
Instead, the basket (which must be the same for both source and target) is inferred where possible; the source or target are added to the basket as necessary. This usually works because it's highly unlikely that neither the source nor the target have been added to a basket already.

Thursday, 5 July 2018

egg Boolean operators

Here's an interview-style question:
What are the Boolean operators in C++?
Of course, it's a trick question; there are two flavours of "Boolean operator":

  • Operators that take Boolean operands, and/or
  • Operators that return Boolean values

So let's be more concrete. Which standard C++ operators can you substitute for "◊" to make the following well-formed?

    // C++
    bool a = false;
    bool b = true;
    auto c = a ◊ b;

It turns out that any C++ operator that can be applied to integers can be substituted, including shift operators like "<<". Because of the history of the introduction of "bool" values into the language, they often get type-promoted to integers silently. For example, "c" has type "int" in the following:

    // C++
    bool a = false;
    bool b = true;
    auto c = a << b;

Whilst the following holds true:

    // C++
    assert((false - true) == -1);

The list of C++ operators that take Boolean operands and return Boolean values is much smaller:

  • a == b
  • a != b
  • a < b
  • a <= b
  • a >= b
  • a > b
  • a && b
  • a || b
  • !a

In Java, Boolean values seem to be treated more carefully. It adds the "&", "|" and "^" operators.

Let's not even think about JavaScript!

I've been extending my test coverage of egg scripts (using OpenCppCoverage) and have come across this inconsistency. As part of the effort to make egg easy to learn by reducing surprise, I'm shying away from automatic type promotion (although I suspect integer-to-float promotions will always be warranted). Therefore, I decided to add explicit Java-style Boolean operators to egg.

    // egg
    bool a = foo();
    bool b = bar();
    var c1 = a & b; // 'c1' is bool
    var c2 = a | b; // 'c2' is bool
    var c3 = a ^ b; // 'c3' is bool

And, while I'm at, I'll add the following operators for orthogonality:

    // egg
    bool a = foo();
    bool b = bar();
    a &&= b; // Same as 'if (a) { a = b; }'
    a ||= b; // Same as 'if (!a) { a = b; }'
    int? c = baz();
    c ??= 123; // Same as 'if (c == null) { c = 123; }'

The last operator is particularly useful when dealing with default parameters.

And then we have the rabbit-hole of the missing "^^" and "^^=" operators...

Postscript

I also noticed today that spaceships are set to invade C++20. I will resist such an invasion of egg!

Wednesday, 27 June 2018

Unicode Rebus Inequalities

Here's a quiz to celebrate the release of Unicode 11.0

What's the solution to the following "equation"?

🐕🦏🦅🛡🐌 = ?

If your browser's struggling with the Unicode characters, they should look something like this:


The answer is "DRESS" ... isn't it?


If we take the pictures on the left-hand side, we get "DOG", "RHINOCEROS", "EAGLE", "SHIELD" and "SNAIL". The initial letters spell "DRESS". Any school-child will tell you that. Easy, eh? Well, yes and no. Or should I say oui et non?

If I'm a French-speaker, I get a different result: "CHIEN", "RHINOCÉROS", "AIGLE", "BOUCLIER" and "ESCARGOT". This spells "CRABE" (crab):


By the way, there's no ambiguity in the names here; I'm using official Unicode international names for these code-points.

Perhaps we could reformulate "DRESS" using different code-points to try to get around this confusion:


That's "DIZZY", "ROOSTER", "EAR", "SLED" and "SNAIL" to spell "DRESS". The pictures are a bit more esoteric, but they still don't solve the confusion. In French, "ÉTOURDISSEMENT", "COQ", "OREILLE", "LUGE" and "ESCARGOT" spell out "ÉCOLE" (school):


These "Unicode rebus inequalities" don't just occur between English and French. Consider another English encoding for "DRESS" ("DRESS", "ROCKET", "EGG", "STATION" and "SCISSORS"):


In German, those pictures ("KLEID", "RAKETE", "EI", "BAHNHOF" and "SCHERE") spell out "KREBS" (crab again!):


This could form the basis of a fun (well, at least educational) game. The next step up would be to guess the language. Consider "HANDBAG" ("HAMMER", "AIRPLANE", "NEWSPAPER", "DNA", "BED", "ANT" and "GEAR") in English:


But which language produces the following, different right-hand side?


Answer: Spanish. The pictures ("MARTILLO", "AVIÓN", "PERIÓDICO", "ADN", "CAMA", "HORMIGA", "ENGRANAJE") spell out "MAPACHE" (raccoon).

The DNA and raccoon code-points are new in Unicode 11.0. That's why I'm using images for the equations.

Needless to say, there's a fair amount of data munging required to find these inequalities, but in all my searching, I've only found one non-trivial equality:


In English, "BOOKMARK", "RABBIT", "OGRE", "OCTOPUS" and "MOTORWAY" spell out "BROOM". But in Italian, remarkably, the same pictures "SEGNALIBRO", "CONIGLIO", "ORCO", "POLPO" and "AUTOSTRADA" spell out "SCOPA" which means ... [drum roll] ... "BROOM"!

Tuesday, 26 June 2018

egg Garbage Collection 1

I've always struggled with tracing garbage collection. Until last week, I assumed it's a bit of a black art, the internals of which are only understood by a small number of acolytes. In truth, you can go a long way without needing anything more than reference-counting. Indeed, Apple ditched their tracing garbage collector in 2015 in favour of an "automatic reference-counting (ARC)" solution. But memory management is a thorny issue for learners of any language or infrastructure, so I knew that tracing garbage collection was fairly inevitable for the egg run-time.

Background

Garbage collection is any mechanism whereby software knows when a resource is no longer accessible (is "dead", not "alive") and can therefore safely be freed up. For heap-based memory objects, "dead" instances are those that can no longer be reached via pointers.

There are three main forms of garbage collection:

Automatic Reference Counting

Each "live" instance contains a "reference count" of the number of pointers pointing to it. When a new reference is made to the instance, the count is incremented. When a reference is removed, the count is decremented. When the count is decremented to zero, there are, by definition, no pointers pointing at this instance: it can be declared "dead".

One potential problem with ARC is managing cycles. If instance "a" is pointing to instance "b" and instance "b" is pointing back to "a" (either directly or indirectly), the cycle keeps both instances alive (with reference counts of at least one) even if there are no other pointers pointing at either of them. Sometimes cycles can be broken using "weak references" but this is tricky and I won't explore that technique here.

Mark and Sweep

"Mark and sweep" and its variants are tracing garbage collectors that periodically trace which instances are unreachable and therefore suitable for "collection". In its simplest form, the garbage collector "stops the world" (i.e. prevents new instances being created and links between references from changing) and then performs the following:
  1. Create a list of all instances: "everything"
  2. Create an empty set: "marked"
  3. From list "everything", create another list of instances that are accessible from the outside world: "roots"
  4. For each "node" in "roots"
    • (MARK) If "node" is not already in "marked" then
      • Add "node" to "marked"
      • Repeat (MARK) recursively for each pointer in "node" to other nodes
  5. For each "dead" in "everything" that is not in "marked"
    • (SWEEP) Reclaim "dead"
This technique gets around cycles of instances, but it can be quite time-consuming (unpredictably so) and can use extra memory for housekeeping.

C++ Implementations

I've always thought that garbage collectors had to plumb themselves into C++ at quite a low-level; for instance, by redefining "operator new" or using special macros like the excellent Boehm-Demers-Weiser collector. However, I wanted the garbage collector for egg to be easy to switch out for an alternative in the future, so I looked for a less invasive/pervasive solution.

It turns out that garbage collection is just like any other piece of software: you can design it to be as decoupled or coupled as you like (with certain trade-offs), so I set about designing an infrastructure that tries to balance some desirable attributes:
  • Fast to execute
  • Easy to maintain
  • Simple to learn and use
  • Foolproof
  • Quick to implement

Baskets

The first solution I came up with is based around "baskets".
  • A "basket" is made up of zero or more "collectables"
  • Each "collectable" can belong to at most one "basket"
    • A "collectable" that does not belong to a "basket" is deemed an orphan
  • Each "collectable" may optionally be a root of its "basket"
    • All roots are deemed to be always reachable
  • Each "collectable" has zero or more "links" to other "collectables" in its "basket"
    • A "link" can be created, severed or amended
Actions that the programmer needs to perform include:
  • Creating a basket
  • Creating a collectable
  • Adding a collectable to a basket
  • Marking a collectable in a basket as being a root/non-root
  • Creating a link between two collectables in the same basket
  • Modifying a link to point are a difference collectable in the same basket
  • Severing a link
  • Collecting garbage
Note that there is no explicit way to remove a collectable from a basket. However, it may be useful for performance and testing to purge all the collectables from a basket as a single action.

The naive garbage collection algorithm is then:
  1. Stop the world (at least for this basket)
  2. Create a set of all collectables in the basket: "unvisited"
  3. For each "node" in this basket's list of roots
    • (MARK) If "node" is in "unvisited" then
      • Remove "node" from "unvisited"
      • Repeat (MARK) recursively for each link in "node" to other nodes
  4. For each "dead" in "unvisited"
    • (SWEEP) Reclaim "dead"
  5. Start the world
Notice that keeping track of the unvisited nodes, instead of the visited ones, simplifies the logic slightly.

In this scheme, baskets don't actually perform memory management operations on collectables. Collectables are allocated externally and then added to the basket. Similarly, the function to collect garbage returns a list of collectables that have been evicted from the basket. (In reality, it uses the visitor pattern to expose this list, thereby removing the need to physically construct it).

The Root of All Evil?

Attentive readers may have noticed that the concept of "root" is poorly defined. Also, if we want to extend this scheme to concurrent garbage collection (and who wouldn't?), how do we make sure we can set up the web of links between newly-added collectables before the collector sweeps them from under our noses?

Shock! Horror! There's more...

Tuesday, 19 June 2018

egg Pointers and Pointees

This week, I've been struggling with pointers in the egg language specification. First off, a bit of nomenclature: I'm going to use the term "pointer" to mean a reference to data in another location in general; I find the overloaded term "reference" too confusing.

There are any number of pitfalls with pointers, particularly C/C++-style raw pointers. As egg is meant to make programming easier for learners, it's useful to look at some of these issues.

Pointer Arithmetic

C/C++ allows arithmetic on raw pointers:

  // C/C++ code
  char buffer[10];
  char* p = buffer;
  p += 20;
  *p = '!' // Undefined behaviour

This is prone to error, so egg does not allow arbitrary pointer arithmetic. The only thing you can do with them is:
  1. Create pointers by taking the address of a pointee using "&";
  2. Assign one pointer to another;
  3. Pass pointers to functions; and
  4. De-reference a pointer to get hold of the pointee's value
In particular, you cannot increment "++" or decrement "--" a pointer.

Null Pointers

As mentioned in my last blog post, null pointers have been called a "billion-dollar mistake". Consequently, egg pointers cannot contain null.

  // egg code
  int* p = null; // Not allowed

However, there's nothing wrong with declaring the variable as taking a pointer or a null value:

  // egg code
  int*? p = null; // Allowed

Note the subtle difference between "int*?" and "int?*": the former can hold the value null or a pointer to an integer, whilst the latter can hold a (non-null) pointer to a value that is either null or an integer.

Wild Pointers

Wild pointers are raw pointers whose value has not been initialised:

  // C/C++ code
  char* p;
  *p = '!' // Undefined behaviour

The egg language can detect uninitialised pointers, so this shouldn't be a problem.

Dangling Pointers

Dangling pointers are raw pointers that used to point to valid pointees, but no longer do:

  // C++ code
  char* p = new char;
  delete p;
  *p = '!' // Undefined behaviour

The egg language uses garbage collection, so the "delete" (and/or "free") mechanism is redundant (and explicitly disallowed). However, there are other causes of dangling pointers:

  // C/C++ code
  char* p;
  {
    char buffer;
    p = &buffer;
  }
  *p = '!' // Undefined behaviour

The problem here is that "buffer" is allocated on the stack, but that stack memory is "reclaimed" at the end of the block.

The original Pascal solved this problem but not having an "address of" operator. That is, you could not take the address of a variable; you could only obtain a new pointer by allocating data on the heap using "new()". Later flavours of Pascal added the "@" prefix operator for taking the address of arbitrary variables.

In general, stack-allocated values have the potential to cause dangling pointers. See below.

Stack versus Heap

As this Rust document illustrates, the concepts of "stack" and "heap" are important for systems programming and computer science, but they can be confusing for learners. Some languages (like old-style C/C++) force you to be explicit but give very little protection against problems like dangling pointers. Newer languages (like Go and Rust) provide better security, but at the cost of a much steeper (and longer) learning curve.

One solution to the problem is to only allow either stack or heap allocations. Early FORTH implementations only allowed stack allocations: users found this severely limiting. Many scripting languages only allow heap allocations, but for general purpose languages, there are efficiency concerns with this strategy. One compromise, used by many languages, is to introduce the concept of boxing.

Boxing and Unboxing

Boxing is the act of  wrapping fundamental types so that they become "reference" types and so are allocated on the heap. In Rust 1.0, this is explicit:

  // Rust 1.0 code
  let p = Box::new(5); // p points to a heap-allocated integer 5

However, some languages "autobox" fundamental types, which can lead to surprising behaviour. In languages where complex types (sometimes known as "reference" types) are always allocated on the heap, boxing/unboxing is just an annoying edge case.

By-Reference Function Parameters

Most imperative languages have the ability to specify whether individual function parameters are passed by value or by reference. Pointers are one mechanism to achieve this:

  // C code
  void by_value(int x);
  void by_reference(int* x);

  int v = 10;
  by_value(v); // v is not changed
  by_reference(&v); // v may be changed

If using raw pointers, care must be taken concerning null pointers.

In many languages, complex types are passed by what Barbara Liskov et al named call by sharing:

  // JavaScript code
  function by_sharing(x) { ... }

  var v = { a: 1, b: 2 };
  by_sharing(v); // object pointed to by v may change

Therefore, true call-by-reference is often only required for fundamental types.

Pointers in egg

Conceptually, variables (including function parameters) in egg are stored as triples. For example:

  // egg code
  int? v = 10;

will be stored as:
  • Variable name: "v"
  • Variable type: "int|null"
  • Current value: 10
A variable's current value also has a type (in this case "int"); this is known as the run-time type of "v".

Variable triples are generally stored on a stack (one exception is a variable inside a co-routine, but more on that much later) and their values are variants (tagged unions). As variables are introduced, they are added to the top of the stack; as they go out of scope, they are dropped. The variable value variants can store a fairly large number of things (each fits into 64 bits, excluding the tag):
  1. A null value,
  2. Boolean values,
  3. Integer values,
  4. Floating-point values,
  5. References to strings (allocated on the heap),
  6. References to objects (allocated on the heap),
  7. Pointers,
  8. Indirects,
  9. etc.
In these variants, a pointer (Item 7) is a soft (i.e. garbage-collected) pointer to another variant allocated on the heap. An arbitrary level of indirection can be achieved by just chaining together pointers to other pointers.

Consider the following code:

  // egg code
  int v = 10;
  int* p = &v;

The first declaration pushes the following triple to the top of the variable stack:
  • Name="v", Type="int", Value=10
The second declaration has a problem. It tries to push a new triple:
  • Name="p", Type="int*", Value=???
  • Name="v", Type="int", Value=10
But what is the pointer value of "p"? Unfortunately, the value of "v" is not on the heap, as required; if we take its address, we're leaving ourselves open to the dangling pointer problem. Our solution is to patch the value of "v" to be an "Indirect" value (Item 8 above):
  • Name="p", Type="int*", Value=[heap314]
  • Name="v", Type="int", Value=Indirect → 10 at address [heap314]
We do this by copying the previous value of "v" to a new variant (at address [heap314] say) and then modifying the triple for "v" to point to this indirection. Now, the lifetime of the value of "v" is independent of the lifetime of the variable "v". This mitigates the dangling pointer problem, but we must be careful to silently follow "Indirect" references appropriately.

The current implementation patches values in this way if and when the first "address of " operator "&" is invoked on a variable. This means that heap allocations are not required for the vast majority of scalar values created during the execution of an egg program. In practice, it may be safer (in a thread-rich environment) to perform static analysis of the egg code to determine if a variable can ever have its address taken and, if so, create it with the indirection built in. This would eliminate potential race conditions in patching and accessing a variable across threads.

Limitations

This strategy works well with simple variables. There are at least two limitations that I can currently see.

First, this scheme does not extend to co-routine variables. The rules for the lifetimes of co-routine variables are already complicated: adding the ability to extend their lives by taking their address may be a step too far. I suspect storing all co-routine variables on the heap may be a good compromise.

Second, you cannot currently take the address of expressions other than variables.

  // Case 1: constants
  int* p1 = &10;

  // Case 2: arrays
  var a = [ 1, 2, 3 ];
  int* p2 = &(a[1]);

  // Case 3: objects
  var x = { y: 1 };
  int* p3 = &(x.y);

In Case 1, you might reasonably expect "p1" to be pointing at an integer 10 allocated on the heap. You can currently get around this by using a temporary variable, but it may be an edge case I'll revisit.

In Case 2, taking the address of an array element is not permitted. One argument against this ability is to consider what to do if the array is resized to fewer elements. This has re-introduced a dangling pointer.

Case 3 can be argued against in a similar fashion.

Tuesday, 12 June 2018

egg Null and Void

When I first came across the "void" keyword in the C programming language in the late eighties, I was very confused. What was the point of a type that couldn't hold any values? Judging by the questions on stackoverflow, it's still a problem for many learners of many languages. A related issue is the confusion over "null" and "undefined" in JavaScript. This particular thorn has been perpetuated in supposedly "cleaner" languages like TypeScript.

I've been struggling with this issue in the design of egg. But first some background...

Background

The introduction of the concept of "null" into higher-level computer languages by Tony Hoare has been called a "billion-dollar mistake". The idea is that values of the Null type (e.g. "nullptr_t" in C++) can only ever hold a single null value ("nullptr" in C++). In some languages, such at Java, object types (including "String") are implicitly sub-classes of the Null type. That is, you can always store a null value in a Java string variable. In C++, you can always assign "nullptr" to a pointer (e.g. "int*"), though never to references (e.g. "int&").

In contrast, if we ignore "compromised" type systems such as TypeScript, the "void" type can never hold any value. Thus, in C/C++, de-referencing of void pointers is not allowed.

This all fits nicely into an ontology of types:

  • Entities of type "void" can hold no values.
  • Entities of type "null" can hold only one value: "null"
  • Entities of type "bool" can hold only two values: "false" and "true"
  • etc, etc.

Plan A

The problem I'm struggling with in egg is when you use "void" and when to use "null". My first thought was to allow functions to return void (i.e. nothing) as an option as well as being able to return null or some other type:

  float|void root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    return; // Returns a "void" sentinel
  }
  var a = root(-2); // Run-time failure: cannot assign from void

But this can be simulated with exceptions:

  float root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    throw Exception("negative root");
  }
  var a = root(-2); // Run-time exception

or nulls:

  float? root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    return null;
  }
  var a = root(-2); // a is set to null

But the real motivation for using void returns is in generators:

  int|void ...counter() {
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  }
  for (var i : counter()) {
    print(i); // Print 1, 2, 3
  }

When control from the "counter" co-routine falls "off the end" of the function, it implicitly returns a void sentinel which terminates the for-each loop iterator. The same mechanism could be used in "guarded assignment" conditions:

  Person|void getSpouse(Person p) {
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
    }
  }
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope
  }

Given that one of the aims of egg is to reduce extraneous cognitive load, do we really need two concepts, "null" and "void", instead of just one? Is "computer science" making "learning programming" more complicated than necessary? This is the problem I've been struggling with.

Plan B

I don't think we can get rid of the concept of "void" for functions that do not return a value (a.k.a. methods). Fortunately [at the time of writing!], tuples are not supported in egg. Therefore, we could enforce that user-defined functions are declared to return either zero or one value. Generators terminate with a (hidden) sentinel and functions designed to work with guarded assignment conditions should return null.

This would make things look simpler in the examples above:

  int... counter() { // Removed "|void"
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  }
  for (var i : counter()) {
    print(i); // Print 1, 2, 3
  }

and

  Person? getSpouse(Person p) { // Replaced "|void" with "?"
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
    }
    return null; // Now needed
  }
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope
  }

This mechanism seems to clear up another (minor) issue I was worrying about: what is the difference between

  if (var wife = einstein.getSpouse()) ...

and

  if (var? wife = einstein.getSpouse()) ...

With the second plan, it's obvious that the latter form is not useful and I suspect that my gut feeling that "var?" is a useless construct anywhere is true.

Side Note

I came up with the notation "type1|type2" independently, but have since discovered that (at least) the Pike programming language uses the same syntax. Pike also only allows "void" to be used for function return types.
"One thing [the language designer] should not do is to include untried ideas of his own. His task is consolidation, not innovation." 
C. A. R. Hoare, "Hints on Programming Language Design" 1973