Thursday, 31 January 2019

egg Syntax 3

As promised, I've created a guide to the egg programming language based around its syntax.

Even though it doesn't go into the nitty-gritty of aspects such as type schema, built-in functions or attributes, it's still nearly thirty pages long and took considerably longer than expected to put together. Most of the space is take up with pretty railroad diagrams and example code. But some of the delay was due to the realisation that my thinking about the type system was somewhat muddled. I've tinkered with the EBNF quite a bit in the last few weeks and now feel a lot more comfortable with it.

Short email exchanges with Profs Barbara Liskov and Niklaus Wirth convinced me that the type system should only be prevalent at the interfaces between modules. Within modules it shouldn't get in the way. I've always thought that you spend far too long in languages like C++ fussing over the concrete types (classes) instead of the functionality (functions).

Some computer languages alleviate these pressures by having sophisticated type inference, but that can cause headaches for learners (and non-learners!) who can't grok the inference rules, which are usually fiendish. For example, consider:

  int c = 10;
  var fn = (a, b) => a + b + c;

To a human reader, "fn" is obviously a function that adds its two parameters to "c". But what are the types of its parameters and its return value? Integers? Floats? Strings? A mixture? The compiler could deduce additional information by looking at the later usages of "fn" but this would make the inference of its type "non-local" and therefore potentially confusing.

The egg language has "function expressions" which are like strongly-typed "lambda expressions" (e.g. C++ lambdas):

  int c = 10;
  var fn = float(int a, float b) { return a + b + c };

These are unambiguous, but are a little clumsy. So egg accepts lambda expressions (which it converts to function expressions) providing that the types can be trivially inferred. For example:

  type Adder = float(int, float);
  int c = 10;
  Adder fn = (a, b) => a + b + c;

This isn't much of a "win" in the example above, but if the type is inferred by matching function parameters, it leads to more fluent syntax:

  float process(float(int, float) fn) {
    ...
  }
  int c = 10;
  var result = process((a, b) => a + b + c);

When inferring function types this way, "trivial" means that only the arity (i.e. the number of parameters) of the function is considered. This is further simplified by the restriction that. in egg, lambdas cannot be variadic, only true functions can. For example:

  type Fn0 = void();
  type Fn1 = void(string);
  type Fn3 = void(int,string,int);
  void process(Fn0|Fn1|Fn3 fn, string s, int a, int b) {
    if (Fn0 fn0 = fn) {
      fn0();
    } else if (Fn1 fn1 = fn) {
      fn1(s);
    } else if (Fn3 fn3 = fn) {
      fn3(a, s, b);
    } else {
      throw "internal error";
    }
  }
  process((a, s, b) => { print(a, s, b, "\n"); }); // prints "1+2"

This is quite a sophisticated dispatch mechanism for a language that doesn't permit function overloading. It allows library implementers to support algorithmic optimisation based on lambda signatures without increasing the cognitive load for people not interested in this level of detail.

Oh, I've also updated the A3 poster.

Friday, 11 January 2019

A Crowded Chronology of Cities

For a change of pace, I've been working on a web page to illustrate the most populous urban areas throughout history: A Crowded Chronology of Cities



It's interesting that the graph of the population of the largest city isn't monotonic. As far as I can tell, there were two main "reduction" events:
  • The fall of the Roman Empire
  • The fourteenth century plagues (that's my guess, anyway!)

Sunday, 4 November 2018

RGB/HSV in HLSL 8

Andrew (K-Be) emailed me this week with a problem he found in my HLSL code for HCL-to-RGB conversion. Here's the original code:

  float HCLgamma = 3;
  float HCLy0 = 100;
  float HCLmaxL = 0.530454533953517; // == exp(HCLgamma / HCLy0) - 0.5
  float PI = 3.1415926536;
 
  float3 HCLtoRGB(in float3 HCL)
  {
    float3 RGB = 0;
    if (HCL.z != 0)
    {
      float H = HCL.x;
      float C = HCL.y;
      float L = HCL.z * HCLmaxL;
      float Q = exp((1 - C / (2 * L)) * (HCLgamma / HCLy0));
      float U = (2 * L - C) / (2 * Q - 1);
      float V = C / Q;
      // PROBLEM HERE...
      float T = tan((H + min(frac(2 * H) / 4, frac(-2 * H) / 8)) * PI * 2);
      H *= 6;
      if (H <= 1)
      {
        RGB.r = 1;
        RGB.g = T / (1 + T);
      }
      else if (H <= 2)
      {
        RGB.r = (1 + T) / T;
        RGB.g = 1;
      }
      else if (H <= 3)
      {
        RGB.g = 1;
        RGB.b = 1 + T;
      }
      else if (H <= 4)
      {
        RGB.g = 1 / (1 + T);
        RGB.b = 1;
      }
      else if (H <= 5)
      {
        RGB.r = -1 / T;
        RGB.b = 1;
      }
      else
      {
        RGB.r = 1;
        RGB.b = -T;
      }
      RGB = RGB * V + U;
    }
    return RGB;
  }

Note the calculation of 'T'. Let's split that expression in two:

      float A = H + min(frac(2 * H) / 4, frac(-2 * H) / 8);
      float T = tan(A * PI * 2);

We can see that 'T' will tend to infinity when 'A' approaches 0.25 or 0.75. A bit of careful graphing of 'H' against 'A' suggests that this only occurs when the input hue approaches 1/6 or 2/3 respectively, so we can put extra checks in the sextant clauses:

      float A = (H + min(frac(2 * H) / 4, frac(-2 * H) / 8)) * PI * 2;
      float T;
      H *= 6;
      if (H <= 0.999)
      {
        T = tan(A);
        RGB.r = 1;
        RGB.g = T / (1 + T);
      }
      else if (H <= 1.001)
      {
        RGB.r = 1;
        RGB.g = 1;
      }
      else if (H <= 2)
      {
        T = tan(A);
        RGB.r = (1 + T) / T;
        RGB.g = 1;
      }
      else if (H <= 3)
      {
        T = tan(A);
        RGB.g = 1;
        RGB.b = 1 + T;
      }
      else if (H <= 3.999)
      {
        T = tan(A);
        RGB.g = 1 / (1 + T);
        RGB.b = 1;
      }
      else if (H <= 4.001)
      {
        RGB.g = 0;
        RGB.b = 1;
      }
      else if (H <= 5)
      {
        T = tan(A);
        RGB.r = -1 / T;
        RGB.b = 1;
      }
      else
      {
        T = tan(A);
        RGB.r = 1;
        RGB.b = -T;
      }

Of course, if you're confident that your platform won't throw too much of a wobbly when computing 'tan' of half pi et al, you can hoist the calculation of 'T' to its declaration before the 'if' clauses. You never know your luck: your shader compiler might to that for you!

Having said all that, the more I read about the HCL colour space, the less I'm convinced it's actually worthwhile.

Sunday, 21 October 2018

egg Syntax 2

As mentioned last time, I've been working on a poster for the complete egg programming language syntax as a railroad diagram. I finally managed to squeeze it on to a sheet of A3:


Of course,viewing it online as an SVG is a more pleasurable experience.

Over the next few weeks I aim to use this poster as the basis for an introduction to the egg programming language via its syntax.

Tuesday, 2 October 2018

egg Syntax 1

I've been working intensively on the syntax of the egg programming language. In particular, I've been looking at methods for teaching the syntax to learners not familiar with programming languages. But first, as ever, some background...

Backus-Naur Form

Backus-Naur Form (BNF) is a formal specification typically used to describe the syntax of computer languages. In its simplest form, it is a series of rules where each rule offers a choice or sequence of two or more further rules. Ironically, the syntax of BNF varies greatly, but I'll use the following:

<integer> ::= <zero> | <positive>
<positive::= <one-to-nine<opt-digits>
<opt-digits::= <digitsε
<digits::= <digit<opt-digits>
<digit::= <zero<one-to-nine>
<zero::= "0"
<one-to-nine> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

The epsilon "ε" represents a non-existent element.

The rules above define the syntax for (non-negative) integers. Informally,
  • An integer is either zero or a positive integer.
  • A positive integer is a digit "1" to "9" followed by zero or more digits "0" to "9".
These rules explicitly disallow sequences such as "007" being interpreted as "integers".

Formal BNF is great for computers to parse, but verbose and opaque for humans to read. The usual compromise is Extended Backus-Naur Form.

Extended Backus-Naur Form

Extended Backus-Naur Form (EBNF) adds a few more constructs to make rules more concise and (allegedly) easier to read:
  • Suffix operator "?" means "zero or one" of the preceding element or group;
  • Suffix operator "*" means "zero or more" of the preceding element or group;
  • Suffix operator "+" means "one or more" of the preceding element or group;
  • The need for the epsilon "ε" symbol can be removed; and
  • Parentheses are used to group elements
Our example syntax above could be re-written in EBNF as:

<integer::= <zero| <positive>
<positive::= <one-to-nine<digit>*
<digit::= <zero<one-to-nine>
<zero::= "0"
<one-to-nine> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

EBNF syntax rules are used a great deal in computer science, but, as can be seen in the "official" EBNF of EBNF (Section 8.1), it's still quite impenetrable for non-trivial cases.

Railroad Diagrams

Railroad diagrams are graphic representations of syntax rules. I first came across them when I learned Pascal and I believe they are one of the factors in making JSON so successful. As with their textual counterparts, railroad diagrams come in a number of flavours. One of my favourites is Gunther Rademacher's. Paste the following into the "Edit Grammar" tab of http://www.bottlecaps.de/rr/ui for an example:

integer ::= zero | positive
positive ::= one-to-nine digit*
digit ::= zero | one-to-nine
zero ::= "0"
one-to-nine ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

However, with existing railroad syntax diagrams, there's generally a one-to-one correspondence between rules and images. I wondered if there was a way to break this link.

Egg BNF Diagrams

I wrote a simple railroad diagram generator in JavaScript with the following features:

Rules

Rules are enclosed in pale blue boxes:
Terminal tokens are in purple rectangles. References to rules are in brown ovals. Tracks are green lines terminated by green circles.

Choices

Choices are stacked vertically:
Optional elements branch below the main line:

Loops

Loops appear above the main line. There are three main forms: zero or more, one or more and lists with separators:

Embedding

Rule definitions may be embedded as one of the occurrences within another rule:
Using these features, you can express our example syntax using individual Egg BNF Diagrams: 
Or you can embed the rules into a single diagram:
Personally, I find the last diagram gives me a fair indication of the overall structure of the syntax when compared to the stack of diagrams for individual rules.

This gave me the idea of a single poster diagram for the entire egg programming language syntax...

Wednesday, 19 September 2018

egg Virtual Machine 1

It's been nearly two months since my last post on egg. I did have a couple of breaks but the main delay was a substantial re-write due to a blind alley I stumbled down. It was my own fault...

I started the egg project with an informal language specification in my head and a desire to minimise the number of dependencies on external libraries. Consequently, I developed the language syntax and the parser hand-in-hand. This was the first bad idea.

The parser creates an abstract syntax tree (AST). The ability to create an AST does not guarantee that the program is "well-formed", "correct" or anything else; you need to be able to run the program to verify that what you've got what you expected. So I implemented an interpreter that "ran" the AST and developed it alongside the syntax specification and parser. This was my second bad idea.

The problems didn't rear their heads until I started implementing generators. If you remember, these are co-routines that generate sequences of values. At present, there is no standard way of writing portable co-routines in C++. However, there is an experimental proposal for C++20. One alternative is to write a special form of the egg interpreter to execute generator functions (and only generator functions) in a "continuable" manner. Another alternative is to automatically re-write the generator functions as finite state machines. The issue with the first alternative is that it greatly increases the testing burden because you now have two interpreters that must perform consistently. The issue with the second alternative is that it's just plain hard!

But, anyway, I started down the road of the automatic generator function re-writer and quickly came up against two major obstacles:
  1. What was I converting from?
  2. What was I converting to?
Using ASTs to represent chunks of executable code is just daft; the level of abstraction is all wrong. What I needed was a virtual machine "intermediate representation". Sigh.

This was when I realised my first bad idea was to design the egg language syntax alongside the parser. The developer is just too tempted to use existing portions of the parser to implement new language features instead of taking a step back and saying "What would make most sense for a learner (or practitioner) of this language?"

My second bad idea was to interpret the AST directly and this turned out to be far more serious. The implementation of the execution subsystem pollutes that of the parser and vice versa. I intend to re-write the parser in egg (dogfooding) as soon as possible, and extricating the execution-only portion in C++ would have been a nightmare.

The egg virtual machine is called "ovum" and is extremely high-level. This is a deliberate design decision enabling us to cross-compile egg modules to other computer languages and still be readable by humans. In fact, you can "compile" an egg module, throw away the source, "de-compile" it to egg and get back pretty much what you fed in, but nicely formatted and without the comments. Ovum doesn't even dictate whether the underlying "machine" is stack-based, register-based, or whatever.

The external representation is a naturally-compact byte code. The internal presentation is a tree, not of syntax, but of opcodes. The opcodes describe the semantics of the program; there are far more of them than keywords in the egg language because context is everything. Therefore, a task such as scanning the program for variable declarations is much more simple that via the AST.

I took the execution re-write as an opportunity to refactor some troubling aspects of the previous implementation caused by my two bad ideas. However, this means that the parser and the compiler now use a different type system to the run-time! But, as I mentioned, I'm hoping to throw away the parser and compiler in the near future and have them running on the ovum VM.

The extensive test suites I've been maintaining, including dozens of example scripts, mean I'm fairly confident that everything is still working as expected. I'm back to the stage where I need to implement generator functions, and at least now it's a VM I'm working with and not an AST! But I did lose a few weeks.

Here's an old meme to cheer us all up:

Oh, the hue manatee!

Saturday, 28 July 2018

egg Sequences 3

Sequences (as introduced earlier) are a good way to abstract lists of homogeneous data. We are only allowed to see one piece of data at a time and we don't know (until we get there) where the end of the data is. Indeed, sequences can be infinite.

This promotes the use of streaming algorithms which allow us to process very large data sets (i.e. ones that cannot all fit into memory at one time).

However, there is a hidden danger here:
Sequences may promote sequential processing of data
The hint's in the name!

Sequential processing on today's multi-core computers is rightly frowned upon. We should be writing code that is either explicitly parallel or gives the compiler and/or run-time the opportunity to parallelize. But sequential list processing, where you primarily only process the head of the list before moving on to the tail elements, goes all the way back to Lisp; over sixty years ago! It's no wonder these concepts are so deeply ingrained.

Sequences are often reduced via accumulation. As Guy Steel hints at in "Four Solutions to a Trivial Problem", accumulation can be difficult to automatically parallelize. The abandoned Fortress project was an attempt to mitigate this.

The egg language (not in the same league!) is designed to be easy to learn, but I would like some aspects of concurrency and parallelization to be accessible, even if only as a taste of more complex paradigms. Microsoft's PLINQ demonstrates that it is possible to build parallel execution on top of sequences, but I need to do a lot more research in this area. In particular, do I need to worry about this as part of the language specification as opposed to a library, like PLINQ?