Wednesday, 27 June 2018

Unicode Rebus Inequalities

Here's a quiz to celebrate the release of Unicode 11.0

What's the solution to the following "equation"?

🐕🦏🦅🛡🐌 = ?

If your browser's struggling with the Unicode characters, they should look something like this:


The answer is "DRESS" ... isn't it?


If we take the pictures on the left-hand side, we get "DOG", "RHINOCEROS", "EAGLE", "SHIELD" and "SNAIL". The initial letters spell "DRESS". Any school-child will tell you that. Easy, eh? Well, yes and no. Or should I say oui et non?

If I'm a French-speaker, I get a different result: "CHIEN", "RHINOCÉROS", "AIGLE", "BOUCLIER" and "ESCARGOT". This spells "CRABE" (crab):


By the way, there's no ambiguity in the names here; I'm using official Unicode international names for these code-points.

Perhaps we could reformulate "DRESS" using different code-points to try to get around this confusion:


That's "DIZZY", "ROOSTER", "EAR", "SLED" and "SNAIL" to spell "DRESS". The pictures are a bit more esoteric, but they still don't solve the confusion. In French, "ÉTOURDISSEMENT", "COQ", "OREILLE", "LUGE" and "ESCARGOT" spell out "ÉCOLE" (school):


These "Unicode rebus inequalities" don't just occur between English and French. Consider another English encoding for "DRESS" ("DRESS", "ROCKET", "EGG", "STATION" and "SCISSORS"):


In German, those pictures ("KLEID", "RAKETE", "EI", "BAHNHOF" and "SCHERE") spell out "KREBS" (crab again!):


This could form the basis of a fun (well, at least educational) game. The next step up would be to guess the language. Consider "HANDBAG" ("HAMMER", "AIRPLANE", "NEWSPAPER", "DNA", "BED", "ANT" and "GEAR") in English:


But which language produces the following, different right-hand side?


Answer: Spanish. The pictures ("MARTILLO", "AVIÓN", "PERIÓDICO", "ADN", "CAMA", "HORMIGA", "ENGRANAJE") spell out "MAPACHE" (raccoon).

The DNA and raccoon code-points are new in Unicode 11.0. That's why I'm using images for the equations.

Needless to say, there's a fair amount of data munging required to find these inequalities, but in all my searching, I've only found one non-trivial equality:


In English, "BOOKMARK", "RABBIT", "OGRE", "OCTOPUS" and "MOTORWAY" spell out "BROOM". But in Italian, remarkably, the same pictures "SEGNALIBRO", "CONIGLIO", "ORCO", "POLPO" and "AUTOSTRADA" spell out "SCOPA" which means ... [drum roll] ... "BROOM"!

Tuesday, 26 June 2018

egg Garbage Collection 1

I've always struggled with tracing garbage collection. Until last week, I assumed it's a bit of a black art, the internals of which are only understood by a small number of acolytes. In truth, you can go a long way without needing anything more than reference-counting. Indeed, Apple ditched their tracing garbage collector in 2015 in favour of an "automatic reference-counting (ARC)" solution. But memory management is a thorny issue for learners of any language or infrastructure, so I knew that tracing garbage collection was fairly inevitable for the egg run-time.

Background

Garbage collection is any mechanism whereby software knows when a resource is no longer accessible (is "dead", not "alive") and can therefore safely be freed up. For heap-based memory objects, "dead" instances are those that can no longer be reached via pointers.

There are three main forms of garbage collection:

Automatic Reference Counting

Each "live" instance contains a "reference count" of the number of pointers pointing to it. When a new reference is made to the instance, the count is incremented. When a reference is removed, the count is decremented. When the count is decremented to zero, there are, by definition, no pointers pointing at this instance: it can be declared "dead".

One potential problem with ARC is managing cycles. If instance "a" is pointing to instance "b" and instance "b" is pointing back to "a" (either directly or indirectly), the cycle keeps both instances alive (with reference counts of at least one) even if there are no other pointers pointing at either of them. Sometimes cycles can be broken using "weak references" but this is tricky and I won't explore that technique here.

Mark and Sweep

"Mark and sweep" and its variants are tracing garbage collectors that periodically trace which instances are unreachable and therefore suitable for "collection". In its simplest form, the garbage collector "stops the world" (i.e. prevents new instances being created and links between references from changing) and then performs the following:
  1. Create a list of all instances: "everything"
  2. Create an empty set: "marked"
  3. From list "everything", create another list of instances that are accessible from the outside world: "roots"
  4. For each "node" in "roots"
    • (MARK) If "node" is not already in "marked" then
      • Add "node" to "marked"
      • Repeat (MARK) recursively for each pointer in "node" to other nodes
  5. For each "dead" in "everything" that is not in "marked"
    • (SWEEP) Reclaim "dead"
This technique gets around cycles of instances, but it can be quite time-consuming (unpredictably so) and can use extra memory for housekeeping.

C++ Implementations

I've always thought that garbage collectors had to plumb themselves into C++ at quite a low-level; for instance, by redefining "operator new" or using special macros like the excellent Boehm-Demers-Weiser collector. However, I wanted the garbage collector for egg to be easy to switch out for an alternative in the future, so I looked for a less invasive/pervasive solution.

It turns out that garbage collection is just like any other piece of software: you can design it to be as decoupled or coupled as you like (with certain trade-offs), so I set about designing an infrastructure that tries to balance some desirable attributes:
  • Fast to execute
  • Easy to maintain
  • Simple to learn and use
  • Foolproof
  • Quick to implement

Baskets

The first solution I came up with is based around "baskets".
  • A "basket" is made up of zero or more "collectables"
  • Each "collectable" can belong to at most one "basket"
    • A "collectable" that does not belong to a "basket" is deemed an orphan
  • Each "collectable" may optionally be a root of its "basket"
    • All roots are deemed to be always reachable
  • Each "collectable" has zero or more "links" to other "collectables" in its "basket"
    • A "link" can be created, severed or amended
Actions that the programmer needs to perform include:
  • Creating a basket
  • Creating a collectable
  • Adding a collectable to a basket
  • Marking a collectable in a basket as being a root/non-root
  • Creating a link between two collectables in the same basket
  • Modifying a link to point are a difference collectable in the same basket
  • Severing a link
  • Collecting garbage
Note that there is no explicit way to remove a collectable from a basket. However, it may be useful for performance and testing to purge all the collectables from a basket as a single action.

The naive garbage collection algorithm is then:
  1. Stop the world (at least for this basket)
  2. Create a set of all collectables in the basket: "unvisited"
  3. For each "node" in this basket's list of roots
    • (MARK) If "node" is in "unvisited" then
      • Remove "node" from "unvisited"
      • Repeat (MARK) recursively for each link in "node" to other nodes
  4. For each "dead" in "unvisited"
    • (SWEEP) Reclaim "dead"
  5. Start the world
Notice that keeping track of the unvisited nodes, instead of the visited ones, simplifies the logic slightly.

In this scheme, baskets don't actually perform memory management operations on collectables. Collectables are allocated externally and then added to the basket. Similarly, the function to collect garbage returns a list of collectables that have been evicted from the basket. (In reality, it uses the visitor pattern to expose this list, thereby removing the need to physically construct it).

The Root of All Evil?

Attentive readers may have noticed that the concept of "root" is poorly defined. Also, if we want to extend this scheme to concurrent garbage collection (and who wouldn't?), how do we make sure we can set up the web of links between newly-added collectables before the collector sweeps them from under our noses?

Shock! Horror! There's more...

Tuesday, 19 June 2018

egg Pointers and Pointees

This week, I've been struggling with pointers in the egg language specification. First off, a bit of nomenclature: I'm going to use the term "pointer" to mean a reference to data in another location in general; I find the overloaded term "reference" too confusing.

There are any number of pitfalls with pointers, particularly C/C++-style raw pointers. As egg is meant to make programming easier for learners, it's useful to look at some of these issues.

Pointer Arithmetic

C/C++ allows arithmetic on raw pointers:

  // C/C++ code
  char buffer[10];
  char* p = buffer;
  p += 20;
  *p = '!' // Undefined behaviour

This is prone to error, so egg does not allow arbitrary pointer arithmetic. The only thing you can do with them is:
  1. Create pointers by taking the address of a pointee using "&";
  2. Assign one pointer to another;
  3. Pass pointers to functions; and
  4. De-reference a pointer to get hold of the pointee's value
In particular, you cannot increment "++" or decrement "--" a pointer.

Null Pointers

As mentioned in my last blog post, null pointers have been called a "billion-dollar mistake". Consequently, egg pointers cannot contain null.

  // egg code
  int* p = null; // Not allowed

However, there's nothing wrong with declaring the variable as taking a pointer or a null value:

  // egg code
  int*? p = null; // Allowed

Note the subtle difference between "int*?" and "int?*": the former can hold the value null or a pointer to an integer, whilst the latter can hold a (non-null) pointer to a value that is either null or an integer.

Wild Pointers

Wild pointers are raw pointers whose value has not been initialised:

  // C/C++ code
  char* p;
  *p = '!' // Undefined behaviour

The egg language can detect uninitialised pointers, so this shouldn't be a problem.

Dangling Pointers

Dangling pointers are raw pointers that used to point to valid pointees, but no longer do:

  // C++ code
  char* p = new char;
  delete p;
  *p = '!' // Undefined behaviour

The egg language uses garbage collection, so the "delete" (and/or "free") mechanism is redundant (and explicitly disallowed). However, there are other causes of dangling pointers:

  // C/C++ code
  char* p;
  {
    char buffer;
    p = &buffer;
  }
  *p = '!' // Undefined behaviour

The problem here is that "buffer" is allocated on the stack, but that stack memory is "reclaimed" at the end of the block.

The original Pascal solved this problem but not having an "address of" operator. That is, you could not take the address of a variable; you could only obtain a new pointer by allocating data on the heap using "new()". Later flavours of Pascal added the "@" prefix operator for taking the address of arbitrary variables.

In general, stack-allocated values have the potential to cause dangling pointers. See below.

Stack versus Heap

As this Rust document illustrates, the concepts of "stack" and "heap" are important for systems programming and computer science, but they can be confusing for learners. Some languages (like old-style C/C++) force you to be explicit but give very little protection against problems like dangling pointers. Newer languages (like Go and Rust) provide better security, but at the cost of a much steeper (and longer) learning curve.

One solution to the problem is to only allow either stack or heap allocations. Early FORTH implementations only allowed stack allocations: users found this severely limiting. Many scripting languages only allow heap allocations, but for general purpose languages, there are efficiency concerns with this strategy. One compromise, used by many languages, is to introduce the concept of boxing.

Boxing and Unboxing

Boxing is the act of  wrapping fundamental types so that they become "reference" types and so are allocated on the heap. In Rust 1.0, this is explicit:

  // Rust 1.0 code
  let p = Box::new(5); // p points to a heap-allocated integer 5

However, some languages "autobox" fundamental types, which can lead to surprising behaviour. In languages where complex types (sometimes known as "reference" types) are always allocated on the heap, boxing/unboxing is just an annoying edge case.

By-Reference Function Parameters

Most imperative languages have the ability to specify whether individual function parameters are passed by value or by reference. Pointers are one mechanism to achieve this:

  // C code
  void by_value(int x);
  void by_reference(int* x);

  int v = 10;
  by_value(v); // v is not changed
  by_reference(&v); // v may be changed

If using raw pointers, care must be taken concerning null pointers.

In many languages, complex types are passed by what Barbara Liskov et al named call by sharing:

  // JavaScript code
  function by_sharing(x) { ... }

  var v = { a: 1, b: 2 };
  by_sharing(v); // object pointed to by v may change

Therefore, true call-by-reference is often only required for fundamental types.

Pointers in egg

Conceptually, variables (including function parameters) in egg are stored as triples. For example:

  // egg code
  int? v = 10;

will be stored as:
  • Variable name: "v"
  • Variable type: "int|null"
  • Current value: 10
A variable's current value also has a type (in this case "int"); this is known as the run-time type of "v".

Variable triples are generally stored on a stack (one exception is a variable inside a co-routine, but more on that much later) and their values are variants (tagged unions). As variables are introduced, they are added to the top of the stack; as they go out of scope, they are dropped. The variable value variants can store a fairly large number of things (each fits into 64 bits, excluding the tag):
  1. A null value,
  2. Boolean values,
  3. Integer values,
  4. Floating-point values,
  5. References to strings (allocated on the heap),
  6. References to objects (allocated on the heap),
  7. Pointers,
  8. Indirects,
  9. etc.
In these variants, a pointer (Item 7) is a soft (i.e. garbage-collected) pointer to another variant allocated on the heap. An arbitrary level of indirection can be achieved by just chaining together pointers to other pointers.

Consider the following code:

  // egg code
  int v = 10;
  int* p = &v;

The first declaration pushes the following triple to the top of the variable stack:
  • Name="v", Type="int", Value=10
The second declaration has a problem. It tries to push a new triple:
  • Name="p", Type="int*", Value=???
  • Name="v", Type="int", Value=10
But what is the pointer value of "p"? Unfortunately, the value of "v" is not on the heap, as required; if we take its address, we're leaving ourselves open to the dangling pointer problem. Our solution is to patch the value of "v" to be an "Indirect" value (Item 8 above):
  • Name="p", Type="int*", Value=[heap314]
  • Name="v", Type="int", Value=Indirect → 10 at address [heap314]
We do this by copying the previous value of "v" to a new variant (at address [heap314] say) and then modifying the triple for "v" to point to this indirection. Now, the lifetime of the value of "v" is independent of the lifetime of the variable "v". This mitigates the dangling pointer problem, but we must be careful to silently follow "Indirect" references appropriately.

The current implementation patches values in this way if and when the first "address of " operator "&" is invoked on a variable. This means that heap allocations are not required for the vast majority of scalar values created during the execution of an egg program. In practice, it may be safer (in a thread-rich environment) to perform static analysis of the egg code to determine if a variable can ever have its address taken and, if so, create it with the indirection built in. This would eliminate potential race conditions in patching and accessing a variable across threads.

Limitations

This strategy works well with simple variables. There are at least two limitations that I can currently see.

First, this scheme does not extend to co-routine variables. The rules for the lifetimes of co-routine variables are already complicated: adding the ability to extend their lives by taking their address may be a step too far. I suspect storing all co-routine variables on the heap may be a good compromise.

Second, you cannot currently take the address of expressions other than variables.

  // Case 1: constants
  int* p1 = &10;

  // Case 2: arrays
  var a = [ 1, 2, 3 ];
  int* p2 = &(a[1]);

  // Case 3: objects
  var x = { y: 1 };
  int* p3 = &(x.y);

In Case 1, you might reasonably expect "p1" to be pointing at an integer 10 allocated on the heap. You can currently get around this by using a temporary variable, but it may be an edge case I'll revisit.

In Case 2, taking the address of an array element is not permitted. One argument against this ability is to consider what to do if the array is resized to fewer elements. This has re-introduced a dangling pointer.

Case 3 can be argued against in a similar fashion.

Tuesday, 12 June 2018

egg Null and Void

When I first came across the "void" keyword in the C programming language in the late eighties, I was very confused. What was the point of a type that couldn't hold any values? Judging by the questions on stackoverflow, it's still a problem for many learners of many languages. A related issue is the confusion over "null" and "undefined" in JavaScript. This particular thorn has been perpetuated in supposedly "cleaner" languages like TypeScript.

I've been struggling with this issue in the design of egg. But first some background...

Background

The introduction of the concept of "null" into higher-level computer languages by Tony Hoare has been called a "billion-dollar mistake". The idea is that values of the Null type (e.g. "nullptr_t" in C++) can only ever hold a single null value ("nullptr" in C++). In some languages, such at Java, object types (including "String") are implicitly sub-classes of the Null type. That is, you can always store a null value in a Java string variable. In C++, you can always assign "nullptr" to a pointer (e.g. "int*"), though never to references (e.g. "int&").

In contrast, if we ignore "compromised" type systems such as TypeScript, the "void" type can never hold any value. Thus, in C/C++, de-referencing of void pointers is not allowed.

This all fits nicely into an ontology of types:

  • Entities of type "void" can hold no values.
  • Entities of type "null" can hold only one value: "null"
  • Entities of type "bool" can hold only two values: "false" and "true"
  • etc, etc.

Plan A

The problem I'm struggling with in egg is when you use "void" and when to use "null". My first thought was to allow functions to return void (i.e. nothing) as an option as well as being able to return null or some other type:

  float|void root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    return; // Returns a "void" sentinel
  }
  var a = root(-2); // Run-time failure: cannot assign from void

But this can be simulated with exceptions:

  float root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    throw Exception("negative root");
  }
  var a = root(-2); // Run-time exception

or nulls:

  float? root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    return null;
  }
  var a = root(-2); // a is set to null

But the real motivation for using void returns is in generators:

  int|void ...counter() {
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  }
  for (var i : counter()) {
    print(i); // Print 1, 2, 3
  }

When control from the "counter" co-routine falls "off the end" of the function, it implicitly returns a void sentinel which terminates the for-each loop iterator. The same mechanism could be used in "guarded assignment" conditions:

  Person|void getSpouse(Person p) {
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
    }
  }
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope
  }

Given that one of the aims of egg is to reduce extraneous cognitive load, do we really need two concepts, "null" and "void", instead of just one? Is "computer science" making "learning programming" more complicated than necessary? This is the problem I've been struggling with.

Plan B

I don't think we can get rid of the concept of "void" for functions that do not return a value (a.k.a. methods). Fortunately [at the time of writing!], tuples are not supported in egg. Therefore, we could enforce that user-defined functions are declared to return either zero or one value. Generators terminate with a (hidden) sentinel and functions designed to work with guarded assignment conditions should return null.

This would make things look simpler in the examples above:

  int... counter() { // Removed "|void"
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  }
  for (var i : counter()) {
    print(i); // Print 1, 2, 3
  }

and

  Person? getSpouse(Person p) { // Replaced "|void" with "?"
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
    }
    return null; // Now needed
  }
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope
  }

This mechanism seems to clear up another (minor) issue I was worrying about: what is the difference between

  if (var wife = einstein.getSpouse()) ...

and

  if (var? wife = einstein.getSpouse()) ...

With the second plan, it's obvious that the latter form is not useful and I suspect that my gut feeling that "var?" is a useless construct anywhere is true.

Side Note

I came up with the notation "type1|type2" independently, but have since discovered that (at least) the Pike programming language uses the same syntax. Pike also only allows "void" to be used for function return types.
"One thing [the language designer] should not do is to include untried ideas of his own. His task is consolidation, not innovation." 
C. A. R. Hoare, "Hints on Programming Language Design" 1973