Tuesday, 19 June 2018

egg Pointers and Pointees

This week, I've been struggling with pointers in the egg language specification. First off, a bit of nomenclature: I'm going to use the term "pointer" to mean a reference to data in another location in general; I find the overloaded term "reference" too confusing.

There are any number of pitfalls with pointers, particularly C/C++-style raw pointers. As egg is meant to make programming easier for learners, it's useful to look at some of these issues.

Pointer Arithmetic

C/C++ allows arithmetic on raw pointers:

  // C/C++ code
  char buffer[10];
  char* p = buffer;
  p += 20;
  *p = '!' // Undefined behaviour

This is prone to error, so egg does not allow arbitrary pointer arithmetic. The only thing you can do with them is:
  1. Create pointers by taking the address of a pointee using "&";
  2. Assign one pointer to another;
  3. Pass pointers to functions; and
  4. De-reference a pointer to get hold of the pointee's value
In particular, you cannot increment "++" or decrement "--" a pointer.

Null Pointers

As mentioned in my last blog post, null pointers have been called a "billion-dollar mistake". Consequently, egg pointers cannot contain null.

  // egg code
  int* p = null; // Not allowed

However, there's nothing wrong with declaring the variable as taking a pointer or a null value:

  // egg code
  int*? p = null; // Allowed

Note the subtle difference between "int*?" and "int?*": the former can hold the value null or a pointer to an integer, whilst the latter can hold a (non-null) pointer to a value that is either null or an integer.

Wild Pointers

Wild pointers are raw pointers whose value has not been initialised:

  // C/C++ code
  char* p;
  *p = '!' // Undefined behaviour

The egg language can detect uninitialised pointers, so this shouldn't be a problem.

Dangling Pointers

Dangling pointers are raw pointers that used to point to valid pointees, but no longer do:

  // C++ code
  char* p = new char;
  delete p;
  *p = '!' // Undefined behaviour

The egg language uses garbage collection, so the "delete" (and/or "free") mechanism is redundant (and explicitly disallowed). However, there are other causes of dangling pointers:

  // C/C++ code
  char* p;
  {
    char buffer;
    p = &buffer;
  }
  *p = '!' // Undefined behaviour

The problem here is that "buffer" is allocated on the stack, but that stack memory is "reclaimed" at the end of the block.

The original Pascal solved this problem but not having an "address of" operator. That is, you could not take the address of a variable; you could only obtain a new pointer by allocating data on the heap using "new()". Later flavours of Pascal added the "@" prefix operator for taking the address of arbitrary variables.

In general, stack-allocated values have the potential to cause dangling pointers. See below.

Stack versus Heap

As this Rust document illustrates, the concepts of "stack" and "heap" are important for systems programming and computer science, but they can be confusing for learners. Some languages (like old-style C/C++) force you to be explicit but give very little protection against problems like dangling pointers. Newer languages (like Go and Rust) provide better security, but at the cost of a much steeper (and longer) learning curve.

One solution to the problem is to only allow either stack or heap allocations. Early FORTH implementations only allowed stack allocations: users found this severely limiting. Many scripting languages only allow heap allocations, but for general purpose languages, there are efficiency concerns with this strategy. One compromise, used by many languages, is to introduce the concept of boxing.

Boxing and Unboxing

Boxing is the act of  wrapping fundamental types so that they become "reference" types and so are allocated on the heap. In Rust 1.0, this is explicit:

  // Rust 1.0 code
  let p = Box::new(5); // p points to a heap-allocated integer 5

However, some languages "autobox" fundamental types, which can lead to surprising behaviour. In languages where complex types (sometimes known as "reference" types) are always allocated on the heap, boxing/unboxing is just an annoying edge case.

By-Reference Function Parameters

Most imperative languages have the ability to specify whether individual function parameters are passed by value or by reference. Pointers are one mechanism to achieve this:

  // C code
  void by_value(int x);
  void by_reference(int* x);

  int v = 10;
  by_value(v); // v is not changed
  by_reference(&v); // v may be changed

If using raw pointers, care must be taken concerning null pointers.

In many languages, complex types are passed by what Barbara Liskov et al named call by sharing:

  // JavaScript code
  function by_sharing(x) { ... }

  var v = { a: 1, b: 2 };
  by_sharing(v); // object pointed to by v may change

Therefore, true call-by-reference is often only required for fundamental types.

Pointers in egg

Conceptually, variables (including function parameters) in egg are stored as triples. For example:

  // egg code
  int? v = 10;

will be stored as:
  • Variable name: "v"
  • Variable type: "int|null"
  • Current value: 10
A variable's current value also has a type (in this case "int"); this is known as the run-time type of "v".

Variable triples are generally stored on a stack (one exception is a variable inside a co-routine, but more on that much later) and their values are variants (tagged unions). As variables are introduced, they are added to the top of the stack; as they go out of scope, they are dropped. The variable value variants can store a fairly large number of things (each fits into 64 bits, excluding the tag):
  1. A null value,
  2. Boolean values,
  3. Integer values,
  4. Floating-point values,
  5. References to strings (allocated on the heap),
  6. References to objects (allocated on the heap),
  7. Pointers,
  8. Indirects,
  9. etc.
In these variants, a pointer (Item 7) is a soft (i.e. garbage-collected) pointer to another variant allocated on the heap. An arbitrary level of indirection can be achieved by just chaining together pointers to other pointers.

Consider the following code:

  // egg code
  int v = 10;
  int* p = &v;

The first declaration pushes the following triple to the top of the variable stack:
  • Name="v", Type="int", Value=10
The second declaration has a problem. It tries to push a new triple:
  • Name="p", Type="int*", Value=???
  • Name="v", Type="int", Value=10
But what is the pointer value of "p"? Unfortunately, the value of "v" is not on the heap, as required; if we take its address, we're leaving ourselves open to the dangling pointer problem. Our solution is to patch the value of "v" to be an "Indirect" value (Item 8 above):
  • Name="p", Type="int*", Value=[heap314]
  • Name="v", Type="int", Value=Indirect → 10 at address [heap314]
We do this by copying the previous value of "v" to a new variant (at address [heap314] say) and then modifying the triple for "v" to point to this indirection. Now, the lifetime of the value of "v" is independent of the lifetime of the variable "v". This mitigates the dangling pointer problem, but we must be careful to silently follow "Indirect" references appropriately.

The current implementation patches values in this way if and when the first "address of " operator "&" is invoked on a variable. This means that heap allocations are not required for the vast majority of scalar values created during the execution of an egg program. In practice, it may be safer (in a thread-rich environment) to perform static analysis of the egg code to determine if a variable can ever have its address taken and, if so, create it with the indirection built in. This would eliminate potential race conditions in patching and accessing a variable across threads.

Limitations

This strategy works well with simple variables. There are at least two limitations that I can currently see.

First, this scheme does not extend to co-routine variables. The rules for the lifetimes of co-routine variables are already complicated: adding the ability to extend their lives by taking their address may be a step too far. I suspect storing all co-routine variables on the heap may be a good compromise.

Second, you cannot currently take the address of expressions other than variables.

  // Case 1: constants
  int* p1 = &10;

  // Case 2: arrays
  var a = [ 1, 2, 3 ];
  int* p2 = &(a[1]);

  // Case 3: objects
  var x = { y: 1 };
  int* p3 = &(x.y);

In Case 1, you might reasonably expect "p1" to be pointing at an integer 10 allocated on the heap. You can currently get around this by using a temporary variable, but it may be an edge case I'll revisit.

In Case 2, taking the address of an array element is not permitted. One argument against this ability is to consider what to do if the array is resized to fewer elements. This has re-introduced a dangling pointer.

Case 3 can be argued against in a similar fashion.

Tuesday, 12 June 2018

egg Null and Void

When I first came across the "void" keyword in the C programming language in the late eighties, I was very confused. What was the point of a type that couldn't hold any values? Judging by the questions on stackoverflow, it's still a problem for many learners of many languages. A related issue is the confusion over "null" and "undefined" in JavaScript. This particular thorn has been perpetuated in supposedly "cleaner" languages like TypeScript.

I've been struggling with this issue in the design of egg. But first some background...

Background

The introduction of the concept of "null" into higher-level computer languages by Tony Hoare has been called a "billion-dollar mistake". The idea is that values of the Null type (e.g. "nullptr_t" in C++) can only ever hold a single null value ("nullptr" in C++). In some languages, such at Java, object types (including "String") are implicitly sub-classes of the Null type. That is, you can always store a null value in a Java string variable. In C++, you can always assign "nullptr" to a pointer (e.g. "int*"), though never to references (e.g. "int&").

In contrast, if we ignore "compromised" type systems such as TypeScript, the "void" type can never hold any value. Thus, in C/C++, de-referencing of void pointers is not allowed.

This all fits nicely into an ontology of types:

  • Entities of type "void" can hold no values.
  • Entities of type "null" can hold only one value: "null"
  • Entities of type "bool" can hold only two values: "false" and "true"
  • etc, etc.

Plan A

The problem I'm struggling with in egg is when you use "void" and when to use "null". My first thought was to allow functions to return void (i.e. nothing) as an option as well as being able to return null or some other type:

  float|void root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    return; // Returns a "void" sentinel
  }
  var a = root(-2); // Run-time failure: cannot assign from void

But this can be simulated with exceptions:

  float root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    throw Exception("negative root");
  }
  var a = root(-2); // Run-time exception

or nulls:

  float? root(float x) {
    if (x >= 0) {
      return Math.sqrt(x);
    }
    return null;
  }
  var a = root(-2); // a is set to null

But the real motivation for using void returns is in generators:

  int|void ...counter() {
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  }
  for (var i : counter()) {
    print(i); // Print 1, 2, 3
  }

When control from the "counter" co-routine falls "off the end" of the function, it implicitly returns a void sentinel which terminates the for-each loop iterator. The same mechanism could be used in "guarded assignment" conditions:

  Person|void getSpouse(Person p) {
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
    }
  }
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope
  }

Given that one of the aims of egg is to reduce extraneous cognitive load, do we really need two concepts, "null" and "void", instead of just one? Is "computer science" making "learning programming" more complicated than necessary? This is the problem I've been struggling with.

Plan B

I don't think we can get rid of the concept of "void" for functions that do not return a value (a.k.a. methods). Fortunately [at the time of writing!], tuples are not supported in egg. Therefore, we could enforce that user-defined functions are declared to return either zero or one value. Generators terminate with a (hidden) sentinel and functions designed to work with guarded assignment conditions should return null.

This would make things look simpler in the examples above:

  int... counter() { // Removed "|void"
    yield 1;
    yield 2;
    yield 3;
    // implicit "return;"
  }
  for (var i : counter()) {
    print(i); // Print 1, 2, 3
  }

and

  Person? getSpouse(Person p) { // Replaced "|void" with "?"
    Person spouse;
    if (object.tryGetProperty(p, "spouse", &spouse)) {
      return spouse;
    }
    return null; // Now needed
  }
  if (var wife = einstein.getSpouse()) { // Guarded assignment
    print("Einstein's wife is ", wife.name);
  } else {
    print("Einstein has no wife"); // 'wife' is out of scope
  }

This mechanism seems to clear up another (minor) issue I was worrying about: what is the difference between

  if (var wife = einstein.getSpouse()) ...

and

  if (var? wife = einstein.getSpouse()) ...

With the second plan, it's obvious that the latter form is not useful and I suspect that my gut feeling that "var?" is a useless construct anywhere is true.

Side Note

I came up with the notation "type1|type2" independently, but have since discovered that (at least) the Pike programming language uses the same syntax. Pike also only allows "void" to be used for function return types.
"One thing [the language designer] should not do is to include untried ideas of his own. His task is consolidation, not innovation." 
C. A. R. Hoare, "Hints on Programming Language Design" 1973

Wednesday, 30 May 2018

egg Chronology of Computer Languages 4

I couldn't resist re-visiting this after the Bank Holiday weekend.

After a bit more research, the type of diagram I'm actually looking for for COCL is a layered graph drawing, also known as a Sugiyama diagram. As suspected, it's a difficult problem to lay out these diagrams but there are some well-known heuristics. Placing a descendant at an x-position that is an average (mean, median, etc.) of its ancestors' x-positions helps reduce edge crossings. It also seems to organise the programming languages of COCL into related groups.

So I did a bit more fettling of the box-drawing version of the data-set and then produced a colourful SVG version based on that layout data. The only slight wrinkle was getting the arcs looking good when there isn't enough room for a full ninety-degree elbow. See the purple link between "Lisp" and "Logo" below:

I didn't animate the final SVG because it's just too expensive computationally and there's no scope for caching frames like I did with the canvas-based "waterfall" version.


The opacity of each connection is a function of the time in years between the appearance of the ancestor and the descendant. I'm not sure it's strictly meaningful, but it adds variety.

Saturday, 26 May 2018

egg Chronology of Computer Languages 3

As one last task, I looked at producing a more traditional "flow chart" from the COCL data-set. Yet again, it turned out to be surprisingly difficult.

Sankey diagram is a close match for what I'm looking for: an ordered DAG of ancestors and descendants. But, in all the little tools I briefly looked at, the layouts of the final charts leave a lot to be desired.

I started work on a basic network layout engine in C++ but quickly came to a dead end: the number of permutations for the node positioning alone is in the order of 1e+96 which is just not tractable. And this doesn't even take into consideration the edge routing.

I finally plumped for two tools (one in C++ and the other in JavaScript) to semi-automate the layout and then continually fettled it by hand (for instance, lining up the C and Pascal families of languages vertically) until it was relatively compact.

The final result (bottom) displays the diagram using fixed-width, box-drawing characters:



Wednesday, 23 May 2018

egg Chronology of Computer Languages 2

Another visualisation I tried with the COCL data-set was a chord diagram (revamped from the cocktail version I did a couple of years ago). Apart for realising how bad my JavaScript was back then, I discovered that chord diagrams do not make the "direction" of relationships obvious. I've added a bit of animation to try to remedy that when you hover over a language name.
I find this visualisation much more aesthetically pleasing that the rather workman-like "waterfall". The fact that the chronology is clock-like is also rather agreeable.

I turns out that drawing "marching ant" lines is quite expensive, even in modern browsers; so I re-vamped the "waterfall" diagram from yesterday to cache frames. It means that the animation is a bit jittery for the first few seconds after landing on the web page, but quickly becomes very smooth.

Click here for the interactive versions.

Tuesday, 22 May 2018

egg Chronology of Computer Languages 1

I was thinking about "curly brace languages" in the context of egg at the weekend, and remembered a graphic I'd seen of the "family tree" of computer programming languages. I googled around and found the images I'd seen at the turn of the millennium (e.g. here and here), but nothing from the last ten years. Strange, I thought; I'll have a go at either updating or redrawing a tree.

It quickly became apparent why I couldn't find such a chart.

I started by manually scraping the data from the Wikipedia pages and producing a spreadsheet of dependencies with:

  • Language name
  • Year it first appeared
  • Ancestors (prior languages that are said to have influenced it)
  • Link to Wikipedia page
That dataset (COCL: A Chronology of Computer Languages) is stored as JavaScript here. There are over sixty data points ranging from FORTRAN (1957) to Swift (2014). The choice of points is fairly subjective, the data is direct from Wikipedia.

Then I tried to draw the recombining tree (DAG) by hand. I got this far and gave up:


Even just concentrating on a subset (the "curlies") didn't help much:


The problem is that I'm trying to get the ordering of the chronology fixed in one of the axes (the y-axis in this case: oldest at the top, newest at the bottom). I've tried curved lines and even duplicated nodes to try to minimise the number of crossing edges. I've tried putting the oldest languages in the centre of the canvas and working outwards, like a bulls-eye. But nothing seems to work. I'm sure there are technique and/or software out there that can solve this problem, but the linear ordering seems to be beyond the ones I briefly looked at.

So I decided to throw a bit of JavaScript at it. It turns out it's easy to get ugly results from the dataset; the trick is to make it pretty.

My first attempt is an animated waterfall chart:


It's not very elegant, but does give some useful insights into popular ancestors. Everyone seems to want Lisp in their language's DNA these days.

Click here for the interactive version.

Tuesday, 1 May 2018

egg Strings

Currently, strings are a fundamental type in egg. In many languages they are not. I'm still undecided how much compiler and language support to provide. In particular, because egg is supposed to be a stable language, if strings are a fundamental type, any decisions I make now will need to be "baked" into the specification. So I've been looking into what strings actually are.

In languages where strings are not fundamental, such as C++, there may be many representations of text:

  char*
  char[N]
  std::string
  std::wstring
  std::u16string
  std::u32string

This has the advantage that the different representations and implementations can be tailored for different criteria, but it can be a bit overwhelming for learners.

Strings in egg are simply sequences of Unicode characters. They are immutable (like Java) and do not directly expose their internal representation. They are also independent of your locale.

So what can you do with strings like these? The Wikipedia pages leave a bit to be desired on this subject, so I had to start from first principles.

Assignment

Strings can be assigned from string literals or from other strings:

  string greeting = "hello";
  string word = greeting;

Concatenation

Strings can be created by concatenating other strings using the type-name constructor:

  string greeting = string("hello"); // "hello"
  greeting = string(greeting," world"); // "hello world"

Property: int length

Strings have a single property, "length", which is the count of the Unicode code points (not code units) that make up the string:

  string greeting = "hello";
  print(greeting.length); // prints '5'

Operator: string[int]

Strings have an indexing operator to extract single-character substrings at the relevant integer code point index:

  string greeting = "hello";
  string letter = greeting[0]; // sets letter to "h"

If the index is negative, or beyond the end of the string, an exception is thrown.

Operators: == and !=

Strings have equality/inequality operators:

  print("hello" == "hello"); // prints 'true'
  print("hello" == "goodbye"); // prints 'false'
  print("hello" != "hello"); // prints 'false'
  print("hello" != "goodbye"); // prints 'true'

Equality is case-sensitive and based on the equality of the constituent code point sequence. Strings are never equal to non-string values.

Iteration

The code points that make up a string can be iterated:

  string greeting = "hello";
  for (string i : greeting) print(i); // prints 'h' 'e' 'l' 'l' 'o'


Note that the iteration returns single-character strings.

Member: int hash()

Computes a 64-bit signed hash integer for use in containers and algorithms.

Member: int compare(string other)

Compares the string to another in an invariant manner (code point by code point). Returns negative/zero/position depending on whether the string is less/equal.greater than the other.

  "hello".compare("goodbye") // returns a positive integer

Case-insensitive and locale-specific comparisons are outside the remit of the basic string methods.

To prevent accident assumptions about collation ordering, the relational operators "<", "<=", ">=" and ">" are not directly supported for strings.

Member: bool contains(string needle)

Searches for a substring within the string.

  "beggar".contains("egg") // returns true

This is an optimisation of

  haystack.indexOf(needle) != null

Member: bool startsWith(string needle)

Determines if a string starts with substring.

  "beggar".startsWith("beg") // returns true

This is an optimisation of

  haystack.indexOf(needle) == 0

Member: bool endsWith(string needle)

Determines if a string starts with substring.

  "beggar".endsWith("beg") // returns false

This is an optimisation of

  haystack.indexOf(needle) == (haystack.length - needle.length)

Member: int? indexOf(string needle)

Returns the zero-based index of the first occurrence of the substring, or "null" if the substring is not present.

  "beggar".indexOf("egg") // returns 1

The full signature is

  int? indexOf(string needle,
               int? start_index = null,
               int? limit = null,
               bool? negate = null)

Member: int? lastIndexOf(string needle)

Returns the zero-based index of the last occurrence of the substring or "null" if the substring is not present.

  "beggar".lastIndexOf("g") // returns 3

The full signature is

  int? lastIndexOf(string needle,
                   int? start_index = null,
                   int? limit = null,
                   bool? negate = null)

Member: string replace(string needle, string replacement)

Replaces occurrences of a substring with another string.

  "beggar".replace("egg", "e") // returns "bear"

Remember, the result is returned from this function; the original string is unaltered. The full signature is

  string replace(string needle,
                 string replacement,
                 int? max_occurrences = null)

If "max_occurrences" is negative, replacements are made for the last occurrences of the substring.

Member: string slice(int? begin = null, int? end = null)

Extracts a substring from the string. The parameters are both optional indices. Negative indices count back from the end of the string:

  "beggar".slice(1) // returns "eggar"
  "beggar".slice(1, 4) // returns "egg"
  "beggar".slice(1, -2) // returns "egg"

Member: string[] split(string separator, int? limit = null)

Splits a string into an array of substrings:

  "beggar".split("g") // returns ["be","","ar"]

If "limit" is negative, splits are limited to the last occurrences of the separator.

Member: string join(any[] ...params)

Joins strings together:

  "a".join("b", "n", "n", "") // returns "banana"

Member: string repeat(int count)

Creates a new string by concatenating repetitions of a source string:

  "beggar".repeat(0) // returns ""


  "beggar".repeat(3) // returns "beggarbeggarbeggar"

And that's pretty much it. There are some obvious (and deliberate) omissions:

MISSING Member: string format(any[] ...params)

The functionality for formatting general text will live in separate modules. This will allow for more than one formatting scheme and also allow the algorithms to evolve over time.

MISSING Member: int codePointAt(int index)

This is really a function of the (Unicode) encoding specification, so should be defined there:

  import unicode = /*...*/;
  print(unicode.getCodePointAt("hello", 1)); // prints 101

MISSING Members: string lowercase() and string uppercase()

These are omitted because they rely on the Unicode character database, which changes over time. See here. The functionality will probably be made available via a versioned module:

  import unicode = /*...*/;
  print(unicode.toUpper("hello")); // prints "HELLO"

MISSING Members: string trim(), string trimLeft() and string trimRight()

Removing white-space (or any class of character) from the ends of strings is a common operation. The Unicode standards committee designated twenty-odd code points as "white-space". This includes one of my favourite characters: U+1680: OGHAM SPACE MARK. Alas, the categorisation is somewhat fluid, so one requires access to the Unicode character database to get it right.

MISSING Member: string reverse()

Although characters in egg strings are Unicode, reversing a Unicode string may not produce the results you expect. To do it properly again requires access to the Unicode character database, so the reverse function should belong to a versioned module.

Also, there some members that I'm still dithering about:

POSSIBLE Members: string padLeft() and string padRight()

These members can be synthesised using the existing members. However, they are surprisingly easy to get wrong, so maybe they're a useful addition:

  string padLeft(int min_length, string? padding = null)
  string padRight(int min_length, string? padding = null)

Computers Don't Lie T-Shirt

Computer Don't Lie - They're Just Misinformed

Tuesday, 17 April 2018

egg Functions

My first test script for functions in egg was everyone's favourite chestnut: the Fibonacci series.

  int fibonacci(int n) {
    if (n < 2) {
      return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
  }
  print(fibonacci(10));

It does indeed print out "55"!

In egg, functions defined like this are actually special cases of callable objects. The script declares an identifier "fibonacci" which is initialised with an instance of an object that supports the "call" operation: in this case, taking a single integer parameter and returning an integer.

At present, there is no notion of read-only variables in egg, so it's possible to subsequently assign a different function to "fibonacci":

  int fibonacci(int n) {
    if (n < 2) {
      return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
  }
  int zero(int n) {
    return 0;
  }
  fibonacci = zero;
  print(fibonacci(10));

This prints out "0". It's a good demonstration that functions in egg are considered first-class entities, but it might violate the principle of "least surprise" for newcomers.

egg "For" Statements

Recently, I got my first egg script running. Here it is:

  var s = 0;
  for (var i = 0; i < 100; ++i) {
    s += i;
  }
  print(s);

You'll be unsurprised to hear that this printed out "4950" (it was a pleasant surprise to me when it first happened, though, I can tell you!)

There are three things worth mentioning in this script.

Firstly, the "var" keyword initiates type inference for variables "s" and "i". In both cases, "int" is inferred (there are no unsigned integers in egg).

Secondly, the "print" function is a built-in method that will probably be removed, but is useful for testing, at the moment.

Finally, the syntax of "for" statements turns out to be non-trivial. Like C++, I adopted two forms:

  for (before ; condition ; step) { ... }
  for (iterator : collection) { ... }

The latter form is for iterating around collections, which we won't discuss here.

My main concern with the former form of the "for" loop is:
Which syntactic elements are valid for "before", "condition" and "step"?
The easiest of the three clauses is "condition". It's optional, but I only allow expressions that evaluate to Boolean values. This is similar to all the main curly-brace languages (C/C++/Java/JavaScript).

The "before" statement is also optional, but it must be a statement. This includes variable declarations: the scope of such variables is limited to the "for" statement, including the "condition" and "step" clauses. Again this is similar to the other languages (if we ignore JavaScript's problematic scoping rules).

The "step" statement is optional too, but cannot be a declaration. As mentioned previously, the increment and decrement operators are supported in egg purely to allow the classic for-loop idiom seen in this example.

But then I got a bit confused with which egg statements should be valid for "before" and "step". Can you use "break" and "continue"? If so, what do they do?

So I asked my C++ compiler, and it says that I cannot use "break" or "continue", but I can "throw" exceptions. The reason you cannot "break" or "continue" in "before" and "step" clauses is because those statements are just that: statements. The "before" and "step" clauses in C++ expect C++ expressions.

But why can you "throw" exceptions in those clauses in C++?

  for (auto i = 0; i < 100; throw "Bang!") {} // Valid C++

Well, it turns out that what I think of as throw statements are actually throw expressions (of type "void") in the formal syntax. It's a cul-de-sac that others have found themselves in too!

As egg is meant to be an easy-to-learn language, with few surprises, I decided to classify "throw" as a statement and explicitly forbid it in expressions and "before" and "step" clauses of "for" statements.

Tuesday, 3 April 2018

ZX Spectrum Flags of the European Union

Someone, somewhere, out there on the Internet, is looking for the flags of the twenty-eight (current) member countries of the European Union  ... in the original Sinclair ZX Spectrum SCREEN$ format. Surely...

Sunday, 1 April 2018

Mappa Edmundi de Waal

Personally, I think Edmund missed a trick...


(original by Edmund de Waal)

Friday, 23 March 2018

A Prime Example of JavaScript Golf

This is why you shouldn't play JavaScript golf:

for(a=[1];!a[999];)/^(11+)\1+$/.test(a+=1)||print(a.length)

Thursday, 8 March 2018

Vexatious Parses in C++

As part of my work on the egg computer language specification, I've been looking into parsing curly-brace-type languages. There are a number of cul de sacs in these language specifications. Here's one from C++ I've been struggling with today:
    int a = 1;
    int b = 2;
    int c = a-b;
What's the value of "c"? Obviously, it's minus one. But what about this:
    c = a--b;
My Microsoft compiler tells me that this is a malformed expression:
    syntax error: missing ';' before identifier 'b'
But the following is fine:
    c = a---b;
This sets "c" to minus one and decrements "a". Honest.

Here's a list of parses:
    a-b      // Parsed as "a - b"
    a--b     // Fails to compile: missing ';' before identifier 'b'
    a---b    // Parsed as "a-- - b"
    a----b   // Fails to compile: '--' needs l-value
    a-----b  // Fails to compile: '--' needs l-value

    a- -b    // Parsed as "a - -b"
    a- --b   // Parsed as "a - --b"
    a-- -b   // Parsed as "a-- -b"
    a- - -b  // Parsed as "a - - -b"
The compiler is obviously "greedy" when parsing operators; so, in the absence of white-space, it's easy for it to overlook an alternative interpretation:
    a--b     // COULD be parsed as "a - -b"
    a----b   // COULD be parsed as "a-- - -b"
    a-----b  // COULD be parsed as "a-- - --b"
I expect the compiler-writers have their hands tied by the formal language specification. But, for a new language like egg, I don't have any such restrictions.

I decided that prefix and postfix increments/decrements as expressions are bad things. This is mainly due to problems associated with side-effects and evaluation ordering. Consider:
    int a = p[++i] + p[i++]; // Not allowed
However, I think I will retain the prefix increment/decrement statements:
    ++i; // Allowed
    --i; // Allowed
    i++; // Not allowed
    i--; // Not allowed
This permits the idiomatic counter-based loop:
    for (i = 0; i < count; ++i) {
        ...
    }
The reasons for only allowing the prefix versions are two-fold:
  1. It make the language specification much less ambiguous; and
  2. People still harp on about prefix increments/decrements being slightly faster than their postfix variants, which is why they are "preferred" for looping.
Whilst I was at it, I also decided I can probably do without the unary '+' operator. That gets rid of the truly vexatious:
    c = a+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+b;

Tuesday, 27 February 2018

What is Egg?

“Egg” is an idea I’ve been thinking about for a long time. Here’s the background…

At work, over the last few months, I’ve used many computer languages:

  1. C++
  2. C#
  3. Java
  4. JavaScript
  5. Clojure
  6. Python
  7. PowerShell
  8. Windows batch commands
  9. Bash

and almost as many build/configuration file formats:

  1. Makefile
  2. JSON
  3. XML
  4. YAML
  5. INI

I appreciate that domain specific languages have their place, but often runtime performance is not an issue, so using a general-purpose language would be more than adequate. Constantly having to context-switch between difference languages and paradigms is exhausting; not to mention the numerous bugs caused by forgetting the specifics of each set of syntaxes, escape sequences, library routine quirks and so on.

What if there was a simple language that was powerful enough to get the job done without having to remember too many subtleties of the language?

Another issue I have with many languages is the lack of simple interoperability. If I want to call a C++ routine from Clojure, I’m going to have to jump through hoops.

Similarly, if you develop a prototype in one language, you often have to “productionize” it by converting it to another. This is a great source of bugs.

What if there was a language that you could transpile into other languages?

Even if the transpiled code was purely used to get a unit test framework up and running before refactoring, this would greatly mitigate the introduction of bugs.

Some of these interoperability issues are due to the frameworks or virtual machines that some of the languages require:

  • Java Virtual Machine
  • .NET Framework
  • and so on

In this regard, I think that JavaScript is quite successful because of its ubiquity: press F12 inside your browser and you have quite a powerful development environment. Running scripts outside of a browser simply requires you to download a zero-install executable such as Node.js.

What if there was a language that ran almost identically on many frameworks and/or virtual machines?

Anecdotally, it seems that Python is gaining ground as a teaching language. I’m not going to knock Python, but it seems strange that there appear to be few other candidates for teaching good software engineering practices.

What if there was a language that could be used for teaching the fundamentals of programming whilst still being useful outside of academic institutions?

Talking of Python, why do computer languages develop to the point where the designers make breaking changes (e.g. Python 2 versus Python 3)? Even venerable C++ is getting a new set of features every three years that’s difficult to keep up with.

What if there was a language that had a relatively stable syntax?

But “egg” isn’t just a computer language specification, it’s:

  • An engine to run scripts written in egg
  • A compiler to generate native code from egg source
  • A set of transpilers to generate other computer languages from egg source
  • A build system (written in egg, of course)
  • A set of core packages to perform common tasks
  • A testing framework
  • A package manager

So, that’s what “egg” is: a personal project to give me an excuse to investigate these issues.

Monday, 26 February 2018

Egg Day

The last Monday of February is "egg" day...

Sunday, 7 January 2018

Anachronicons 4

This is a tricky one; is the following an anachronicon?
Microphone icon
There's obviously a "retro" vibe going on here (it reminds me of microphones from 1930s/1940s radio sound stages) but I have a suspicion that even modern high-end microphones have similar cradle-like mounts. Nonetheless, I'm guessing very few people have seen a microphone looking like this in real life that wasn't aimed at the "retro" market.