Tuesday, 19 June 2018

egg Pointers and Pointees

This week, I've been struggling with pointers in the egg language specification. First off, a bit of nomenclature: I'm going to use the term "pointer" to mean a reference to data in another location in general; I find the overloaded term "reference" too confusing.

There are any number of pitfalls with pointers, particularly C/C++-style raw pointers. As egg is meant to make programming easier for learners, it's useful to look at some of these issues.

Pointer Arithmetic

C/C++ allows arithmetic on raw pointers:

  // C/C++ code
  char buffer[10];
  char* p = buffer;
  p += 20;
  *p = '!' // Undefined behaviour

This is prone to error, so egg does not allow arbitrary pointer arithmetic. The only thing you can do with them is:
  1. Create pointers by taking the address of a pointee using "&";
  2. Assign one pointer to another;
  3. Pass pointers to functions; and
  4. De-reference a pointer to get hold of the pointee's value
In particular, you cannot increment "++" or decrement "--" a pointer.

Null Pointers

As mentioned in my last blog post, null pointers have been called a "billion-dollar mistake". Consequently, egg pointers cannot contain null.

  // egg code
  int* p = null; // Not allowed

However, there's nothing wrong with declaring the variable as taking a pointer or a null value:

  // egg code
  int*? p = null; // Allowed

Note the subtle difference between "int*?" and "int?*": the former can hold the value null or a pointer to an integer, whilst the latter can hold a (non-null) pointer to a value that is either null or an integer.

Wild Pointers

Wild pointers are raw pointers whose value has not been initialised:

  // C/C++ code
  char* p;
  *p = '!' // Undefined behaviour

The egg language can detect uninitialised pointers, so this shouldn't be a problem.

Dangling Pointers

Dangling pointers are raw pointers that used to point to valid pointees, but no longer do:

  // C++ code
  char* p = new char;
  delete p;
  *p = '!' // Undefined behaviour

The egg language uses garbage collection, so the "delete" (and/or "free") mechanism is redundant (and explicitly disallowed). However, there are other causes of dangling pointers:

  // C/C++ code
  char* p;
    char buffer;
    p = &buffer;
  *p = '!' // Undefined behaviour

The problem here is that "buffer" is allocated on the stack, but that stack memory is "reclaimed" at the end of the block.

The original Pascal solved this problem but not having an "address of" operator. That is, you could not take the address of a variable; you could only obtain a new pointer by allocating data on the heap using "new()". Later flavours of Pascal added the "@" prefix operator for taking the address of arbitrary variables.

In general, stack-allocated values have the potential to cause dangling pointers. See below.

Stack versus Heap

As this Rust document illustrates, the concepts of "stack" and "heap" are important for systems programming and computer science, but they can be confusing for learners. Some languages (like old-style C/C++) force you to be explicit but give very little protection against problems like dangling pointers. Newer languages (like Go and Rust) provide better security, but at the cost of a much steeper (and longer) learning curve.

One solution to the problem is to only allow either stack or heap allocations. Early FORTH implementations only allowed stack allocations: users found this severely limiting. Many scripting languages only allow heap allocations, but for general purpose languages, there are efficiency concerns with this strategy. One compromise, used by many languages, is to introduce the concept of boxing.

Boxing and Unboxing

Boxing is the act of  wrapping fundamental types so that they become "reference" types and so are allocated on the heap. In Rust 1.0, this is explicit:

  // Rust 1.0 code
  let p = Box::new(5); // p points to a heap-allocated integer 5

However, some languages "autobox" fundamental types, which can lead to surprising behaviour. In languages where complex types (sometimes known as "reference" types) are always allocated on the heap, boxing/unboxing is just an annoying edge case.

By-Reference Function Parameters

Most imperative languages have the ability to specify whether individual function parameters are passed by value or by reference. Pointers are one mechanism to achieve this:

  // C code
  void by_value(int x);
  void by_reference(int* x);

  int v = 10;
  by_value(v); // v is not changed
  by_reference(&v); // v may be changed

If using raw pointers, care must be taken concerning null pointers.

In many languages, complex types are passed by what Barbara Liskov et al named call by sharing:

  // JavaScript code
  function by_sharing(x) { ... }

  var v = { a: 1, b: 2 };
  by_sharing(v); // object pointed to by v may change

Therefore, true call-by-reference is often only required for fundamental types.

Pointers in egg

Conceptually, variables (including function parameters) in egg are stored as triples. For example:

  // egg code
  int? v = 10;

will be stored as:
  • Variable name: "v"
  • Variable type: "int|null"
  • Current value: 10
A variable's current value also has a type (in this case "int"); this is known as the run-time type of "v".

Variable triples are generally stored on a stack (one exception is a variable inside a co-routine, but more on that much later) and their values are variants (tagged unions). As variables are introduced, they are added to the top of the stack; as they go out of scope, they are dropped. The variable value variants can store a fairly large number of things (each fits into 64 bits, excluding the tag):
  1. A null value,
  2. Boolean values,
  3. Integer values,
  4. Floating-point values,
  5. References to strings (allocated on the heap),
  6. References to objects (allocated on the heap),
  7. Pointers,
  8. Indirects,
  9. etc.
In these variants, a pointer (Item 7) is a soft (i.e. garbage-collected) pointer to another variant allocated on the heap. An arbitrary level of indirection can be achieved by just chaining together pointers to other pointers.

Consider the following code:

  // egg code
  int v = 10;
  int* p = &v;

The first declaration pushes the following triple to the top of the variable stack:
  • Name="v", Type="int", Value=10
The second declaration has a problem. It tries to push a new triple:
  • Name="p", Type="int*", Value=???
  • Name="v", Type="int", Value=10
But what is the pointer value of "p"? Unfortunately, the value of "v" is not on the heap, as required; if we take its address, we're leaving ourselves open to the dangling pointer problem. Our solution is to patch the value of "v" to be an "Indirect" value (Item 8 above):
  • Name="p", Type="int*", Value=[heap314]
  • Name="v", Type="int", Value=Indirect → 10 at address [heap314]
We do this by copying the previous value of "v" to a new variant (at address [heap314] say) and then modifying the triple for "v" to point to this indirection. Now, the lifetime of the value of "v" is independent of the lifetime of the variable "v". This mitigates the dangling pointer problem, but we must be careful to silently follow "Indirect" references appropriately.

The current implementation patches values in this way if and when the first "address of " operator "&" is invoked on a variable. This means that heap allocations are not required for the vast majority of scalar values created during the execution of an egg program. In practice, it may be safer (in a thread-rich environment) to perform static analysis of the egg code to determine if a variable can ever have its address taken and, if so, create it with the indirection built in. This would eliminate potential race conditions in patching and accessing a variable across threads.


This strategy works well with simple variables. There are at least two limitations that I can currently see.

First, this scheme does not extend to co-routine variables. The rules for the lifetimes of co-routine variables are already complicated: adding the ability to extend their lives by taking their address may be a step too far. I suspect storing all co-routine variables on the heap may be a good compromise.

Second, you cannot currently take the address of expressions other than variables.

  // Case 1: constants
  int* p1 = &10;

  // Case 2: arrays
  var a = [ 1, 2, 3 ];
  int* p2 = &(a[1]);

  // Case 3: objects
  var x = { y: 1 };
  int* p3 = &(x.y);

In Case 1, you might reasonably expect "p1" to be pointing at an integer 10 allocated on the heap. You can currently get around this by using a temporary variable, but it may be an edge case I'll revisit.

In Case 2, taking the address of an array element is not permitted. One argument against this ability is to consider what to do if the array is resized to fewer elements. This has re-introduced a dangling pointer.

Case 3 can be argued against in a similar fashion.

Tuesday, 12 June 2018

egg Null and Void

When I first came across the "void" keyword in the C programming language in the late eighties, I was very confused. What was the point of a type that couldn't hold any values? Judging by the questions on stackoverflow, it's still a problem for many learners of many languages. A related issue is the confusion over "null" and "undefined" in JavaScript. This particular thorn has been perpetuated in supposedly "cleaner" languages like TypeScript.

I've been struggling with this issue in the design of egg. But first some background...


The introduction of the concept of "null" into higher-level computer languages by Tony Hoare has been called a "billion-dollar mistake". The idea is that values of the Null type (e.g. "nullptr_t" in C++) can only ever hold a single null value ("nullptr" in C++). In some languages, such at Java, object types (including "String") are implicitly sub-classes of the Null type. That is, you can always store a null value in a Java string variable. In C++, you can always assign "nullptr" to a pointer (e.g. "int*"), though never to references (e.g. "int&").

In contrast, if we ignore "compromised" type systems such as TypeScript, the "void" type can never hold any value. Thus, in C/C++, de-referencing of void pointers is not allowed.

This all fits nicely into an ontology of types:

  • Entities of type "void" can hold no values.
  • Entities of type "null" can hold only one value: "null"
  • Entities of type "bool" can hold only two values: "false" and "true"
  • etc, etc.

Plan A

The problem I'm struggling with in egg is when you use "void" and when to use "null". My first thought was to allow functions to return void (i.e. nothing) as an option as well as being able to return null or some other type:

  float|void root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    return; // Returns a "void" sentinel
  var a = root(-2); // Run-time failure: cannot assign from void

But this can be simulated with exceptions:

  float root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    throw Exception("negative root");
  var a = root(-2); // Run-time exception

or nulls:

  float? root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    return null;
  var a = root(-2); // a is set to null

But the real motivation for using void returns is in generators:

  int|void ...counter() {
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  for (var i : counter()) {
    print(i); // Print 1, 2, 3

When control from the "counter" co-routine falls "off the end" of the function, it implicitly returns a void sentinel which terminates the for-each loop iterator. The same mechanism could be used in "guarded assignment" conditions:

  Person|void getSpouse(Person p) {
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope

Given that one of the aims of egg is to reduce extraneous cognitive load, do we really need two concepts, "null" and "void", instead of just one? Is "computer science" making "learning programming" more complicated than necessary? This is the problem I've been struggling with.

Plan B

I don't think we can get rid of the concept of "void" for functions that do not return a value (a.k.a. methods). Fortunately [at the time of writing!], tuples are not supported in egg. Therefore, we could enforce that user-defined functions are declared to return either zero or one value. Generators terminate with a (hidden) sentinel and functions designed to work with guarded assignment conditions should return null.

This would make things look simpler in the examples above:

  int... counter() { // Removed "|void"
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  for (var i : counter()) {
    print(i); // Print 1, 2, 3


  Person? getSpouse(Person p) { // Replaced "|void" with "?"
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
    return null; // Now needed
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope

This mechanism seems to clear up another (minor) issue I was worrying about: what is the difference between

  if (var wife = einstein.getSpouse()) ...


  if (var? wife = einstein.getSpouse()) ...

With the second plan, it's obvious that the latter form is not useful and I suspect that my gut feeling that "var?" is a useless construct anywhere is true.

Side Note

I came up with the notation "type1|type2" independently, but have since discovered that (at least) the Pike programming language uses the same syntax. Pike also only allows "void" to be used for function return types.
"One thing [the language designer] should not do is to include untried ideas of his own. His task is consolidation, not innovation." 
C. A. R. Hoare, "Hints on Programming Language Design" 1973

Wednesday, 30 May 2018

egg Chronology of Computer Languages 4

I couldn't resist re-visiting this after the Bank Holiday weekend.

After a bit more research, the type of diagram I'm actually looking for for COCL is a layered graph drawing, also known as a Sugiyama diagram. As suspected, it's a difficult problem to lay out these diagrams but there are some well-known heuristics. Placing a descendant at an x-position that is an average (mean, median, etc.) of its ancestors' x-positions helps reduce edge crossings. It also seems to organise the programming languages of COCL into related groups.

So I did a bit more fettling of the box-drawing version of the data-set and then produced a colourful SVG version based on that layout data. The only slight wrinkle was getting the arcs looking good when there isn't enough room for a full ninety-degree elbow. See the purple link between "Lisp" and "Logo" below:

I didn't animate the final SVG because it's just too expensive computationally and there's no scope for caching frames like I did with the canvas-based "waterfall" version.

The opacity of each connection is a function of the time in years between the appearance of the ancestor and the descendant. I'm not sure it's strictly meaningful, but it adds variety.

Saturday, 26 May 2018

egg Chronology of Computer Languages 3

As one last task, I looked at producing a more traditional "flow chart" from the COCL data-set. Yet again, it turned out to be surprisingly difficult.

Sankey diagram is a close match for what I'm looking for: an ordered DAG of ancestors and descendants. But, in all the little tools I briefly looked at, the layouts of the final charts leave a lot to be desired.

I started work on a basic network layout engine in C++ but quickly came to a dead end: the number of permutations for the node positioning alone is in the order of 1e+96 which is just not tractable. And this doesn't even take into consideration the edge routing.

I finally plumped for two tools (one in C++ and the other in JavaScript) to semi-automate the layout and then continually fettled it by hand (for instance, lining up the C and Pascal families of languages vertically) until it was relatively compact.

The final result (bottom) displays the diagram using fixed-width, box-drawing characters:

Wednesday, 23 May 2018

egg Chronology of Computer Languages 2

Another visualisation I tried with the COCL data-set was a chord diagram (revamped from the cocktail version I did a couple of years ago). Apart for realising how bad my JavaScript was back then, I discovered that chord diagrams do not make the "direction" of relationships obvious. I've added a bit of animation to try to remedy that when you hover over a language name.
I find this visualisation much more aesthetically pleasing that the rather workman-like "waterfall". The fact that the chronology is clock-like is also rather agreeable.

I turns out that drawing "marching ant" lines is quite expensive, even in modern browsers; so I re-vamped the "waterfall" diagram from yesterday to cache frames. It means that the animation is a bit jittery for the first few seconds after landing on the web page, but quickly becomes very smooth.

Click here for the interactive versions.

Tuesday, 22 May 2018

egg Chronology of Computer Languages 1

I was thinking about "curly brace languages" in the context of egg at the weekend, and remembered a graphic I'd seen of the "family tree" of computer programming languages. I googled around and found the images I'd seen at the turn of the millennium (e.g. here and here), but nothing from the last ten years. Strange, I thought; I'll have a go at either updating or redrawing a tree.

It quickly became apparent why I couldn't find such a chart.

I started by manually scraping the data from the Wikipedia pages and producing a spreadsheet of dependencies with:

  • Language name
  • Year it first appeared
  • Ancestors (prior languages that are said to have influenced it)
  • Link to Wikipedia page
That dataset (COCL: A Chronology of Computer Languages) is stored as JavaScript here. There are over sixty data points ranging from FORTRAN (1957) to Swift (2014). The choice of points is fairly subjective, the data is direct from Wikipedia.

Then I tried to draw the recombining tree (DAG) by hand. I got this far and gave up:

Even just concentrating on a subset (the "curlies") didn't help much:

The problem is that I'm trying to get the ordering of the chronology fixed in one of the axes (the y-axis in this case: oldest at the top, newest at the bottom). I've tried curved lines and even duplicated nodes to try to minimise the number of crossing edges. I've tried putting the oldest languages in the centre of the canvas and working outwards, like a bulls-eye. But nothing seems to work. I'm sure there are technique and/or software out there that can solve this problem, but the linear ordering seems to be beyond the ones I briefly looked at.

So I decided to throw a bit of JavaScript at it. It turns out it's easy to get ugly results from the dataset; the trick is to make it pretty.

My first attempt is an animated waterfall chart:

It's not very elegant, but does give some useful insights into popular ancestors. Everyone seems to want Lisp in their language's DNA these days.

Click here for the interactive version.

Tuesday, 1 May 2018

egg Strings

Currently, strings are a fundamental type in egg. In many languages they are not. I'm still undecided how much compiler and language support to provide. In particular, because egg is supposed to be a stable language, if strings are a fundamental type, any decisions I make now will need to be "baked" into the specification. So I've been looking into what strings actually are.

In languages where strings are not fundamental, such as C++, there may be many representations of text:


This has the advantage that the different representations and implementations can be tailored for different criteria, but it can be a bit overwhelming for learners.

Strings in egg are simply sequences of Unicode characters. They are immutable (like Java) and do not directly expose their internal representation. They are also independent of your locale.

So what can you do with strings like these? The Wikipedia pages leave a bit to be desired on this subject, so I had to start from first principles.


Strings can be assigned from string literals or from other strings:

  string greeting = "hello";
  string word = greeting;


Strings can be created by concatenating other strings using the type-name constructor:

  string greeting = string("hello"); // "hello"
  greeting = string(greeting," world"); // "hello world"

Property: int length

Strings have a single property, "length", which is the count of the Unicode code points (not code units) that make up the string:

  string greeting = "hello";
  print(greeting.length); // prints '5'

Operator: string[int]

Strings have an indexing operator to extract single-character substrings at the relevant integer code point index:

  string greeting = "hello";
  string letter = greeting[0]; // sets letter to "h"

If the index is negative, or beyond the end of the string, an exception is thrown.

Operators: == and !=

Strings have equality/inequality operators:

  print("hello" == "hello"); // prints 'true'
  print("hello" == "goodbye"); // prints 'false'
  print("hello" != "hello"); // prints 'false'
  print("hello" != "goodbye"); // prints 'true'

Equality is case-sensitive and based on the equality of the constituent code point sequence. Strings are never equal to non-string values.


The code points that make up a string can be iterated:

  string greeting = "hello";
  for (string i : greeting) print(i); // prints 'h' 'e' 'l' 'l' 'o'

Note that the iteration returns single-character strings.

Member: int hash()

Computes a 64-bit signed hash integer for use in containers and algorithms.

Member: int compare(string other)

Compares the string to another in an invariant manner (code point by code point). Returns negative/zero/position depending on whether the string is less/equal.greater than the other.

  "hello".compare("goodbye") // returns a positive integer

Case-insensitive and locale-specific comparisons are outside the remit of the basic string methods.

To prevent accident assumptions about collation ordering, the relational operators "<", "<=", ">=" and ">" are not directly supported for strings.

Member: bool contains(string needle)

Searches for a substring within the string.

  "beggar".contains("egg") // returns true

This is an optimisation of

  haystack.indexOf(needle) != null

Member: bool startsWith(string needle)

Determines if a string starts with substring.

  "beggar".startsWith("beg") // returns true

This is an optimisation of

  haystack.indexOf(needle) == 0

Member: bool endsWith(string needle)

Determines if a string starts with substring.

  "beggar".endsWith("beg") // returns false

This is an optimisation of

  haystack.indexOf(needle) == (haystack.length - needle.length)

Member: int? indexOf(string needle)

Returns the zero-based index of the first occurrence of the substring, or "null" if the substring is not present.

  "beggar".indexOf("egg") // returns 1

The full signature is

  int? indexOf(string needle,
               int? start_index = null,
               int? limit = null,
               bool? negate = null)

Member: int? lastIndexOf(string needle)

Returns the zero-based index of the last occurrence of the substring or "null" if the substring is not present.

  "beggar".lastIndexOf("g") // returns 3

The full signature is

  int? lastIndexOf(string needle,
                   int? start_index = null,
                   int? limit = null,
                   bool? negate = null)

Member: string replace(string needle, string replacement)

Replaces occurrences of a substring with another string.

  "beggar".replace("egg", "e") // returns "bear"

Remember, the result is returned from this function; the original string is unaltered. The full signature is

  string replace(string needle,
                 string replacement,
                 int? max_occurrences = null)

If "max_occurrences" is negative, replacements are made for the last occurrences of the substring.

Member: string slice(int? begin = null, int? end = null)

Extracts a substring from the string. The parameters are both optional indices. Negative indices count back from the end of the string:

  "beggar".slice(1) // returns "eggar"
  "beggar".slice(1, 4) // returns "egg"
  "beggar".slice(1, -2) // returns "egg"

Member: string[] split(string separator, int? limit = null)

Splits a string into an array of substrings:

  "beggar".split("g") // returns ["be","","ar"]

If "limit" is negative, splits are limited to the last occurrences of the separator.

Member: string join(any[] ...params)

Joins strings together:

  "a".join("b", "n", "n", "") // returns "banana"

Member: string repeat(int count)

Creates a new string by concatenating repetitions of a source string:

  "beggar".repeat(0) // returns ""

  "beggar".repeat(3) // returns "beggarbeggarbeggar"

And that's pretty much it. There are some obvious (and deliberate) omissions:

MISSING Member: string format(any[] ...params)

The functionality for formatting general text will live in separate modules. This will allow for more than one formatting scheme and also allow the algorithms to evolve over time.

MISSING Member: int codePointAt(int index)

This is really a function of the (Unicode) encoding specification, so should be defined there:

  import unicode = /*...*/;
  print(unicode.getCodePointAt("hello", 1)); // prints 101

MISSING Members: string lowercase() and string uppercase()

These are omitted because they rely on the Unicode character database, which changes over time. See here. The functionality will probably be made available via a versioned module:

  import unicode = /*...*/;
  print(unicode.toUpper("hello")); // prints "HELLO"

MISSING Members: string trim(), string trimLeft() and string trimRight()

Removing white-space (or any class of character) from the ends of strings is a common operation. The Unicode standards committee designated twenty-odd code points as "white-space". This includes one of my favourite characters: U+1680: OGHAM SPACE MARK. Alas, the categorisation is somewhat fluid, so one requires access to the Unicode character database to get it right.

MISSING Member: string reverse()

Although characters in egg strings are Unicode, reversing a Unicode string may not produce the results you expect. To do it properly again requires access to the Unicode character database, so the reverse function should belong to a versioned module.

Also, there some members that I'm still dithering about:

POSSIBLE Members: string padLeft() and string padRight()

These members can be synthesised using the existing members. However, they are surprisingly easy to get wrong, so maybe they're a useful addition:

  string padLeft(int min_length, string? padding = null)
  string padRight(int min_length, string? padding = null)