Sunday, 9 February 2020

egg Try Operator

I had a bit of an epiphany earlier in the year. Not one of "Archimedean" proportions (no bathwater was at risk), but I seem to remember audibly muttering "Ooh" at the time. It concerned egg guarded assignment statements.

In many programs, it's common to attempt a data access that you know will fail quite often. For example, looking up an entry in a cache. Programming languages and their libraries provide a number of patterns for doing this:

  // C++
  auto found = cache.find(key);
  if (found != cache.end()) {
    CacheHit(*found);
  } else {
    CacheMiss();
 }

or

  // C#
  if (cache.TryGet(key, out var found)) {
    CacheHit(found);
  } else {
    CacheMiss();
  }

or

  // JavaScript
  if (key in cache) {
    CacheHit(cache[key]);
  } else {
    CacheMiss();
  }

or

  // Java
  if (cache.containsKey(key)) {
    CacheHit(cache.get(key));
  } else {
    CacheMiss();
  }

The C# paradigm (a.k.a. "try-get") is arguably the most elegant (it's atomic and concise), but it leads to a bloated interface for the containers:

  // C#
  interface ICache<K,V> : ...
  {
    bool TryGet(K key, out V value);
    V Get(K key);
    ...
  }

In, egg these interface members look like the following functions:

  // egg
  type<K,V> bool TryGet(K key, V* value);
  type<K,V> V Get(K key);

Both these members perform essentially the same task, but the "TryGet()" method will return "false" if the key doesn't exist, whereas "Get()" will throw an exception.

If we don't supply the "TryGet()" member, we need to use the following paradigm to emulate it:

  // egg
  bool found;
  K value;
  try {
    value = cache.Get(key);
    found = true;
  } catch (Exception e) {
    found = false;
  }

Whereas if we only supply "TryGet()", we need to use the following to emulate "Get()":

  // egg
  K value;
  if (!cache.TryGet(key, &value)) {
    throw ExceptionKeyNotFound();
  }

Like C# 7 pattern matching, the egg language has the concept of guarded assignments:

  if (type variable = expression) {
    assignment-successful
  } else {
    assignment-failed
  }

But this doesn't handle the case where the evaluation of "expression" raises an exception: you'd still need to wrap the whole thing up in a "try-catch" block.

My original idea was to allow functions to return a "void" value which can never be successfully assigned to a variable. You could then use the following pattern:

  // egg
  type<K,V> V|void Fetch(K key);
  ...
  if (var value = cache.Fetch(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

This is good for guarded assignments, but there's no informative exception payload for standard assignments or other evaluations. The following:

  // egg
  var value = cache.Fetch(key);

will just report an unhelpful "unable to assign 'void' value" exception at run-time.

My almost-"Eureka!" moment came when I considered adding a "try" operator to the language:

  type variable = try expression

The new operator is prefix unary and attempts to evaluate its operand at run-time. If it succeeds, the result is the value of the expression. If an exception is raised during the evaluation of the expression, the result of the operator is "void".

In general, this operator only makes sense inside a guarded assignment:

  // egg
  type<K,V> V Get(K key);
  ...
  if (var value = try cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

or, possibly, in an "if-try" condition:

  // egg
  if (try PerformSomething()) {
    Success();
  } else {
    Failure();
  }

In the latter case, the "try" operator takes an operand that evaluates to "void" and returns "false" or "true" depending on whether an exception is thrown. There's not a great benefit of this form beyond the usual standard "try-catch" syntax, and it hides the cause of the exception, so I'm less inclined to support it.

I haven't decided on the precise syntax for "guarded try assignments" (or indeed a better name!) but here are a few options:

Option 1 (currently preferred)

Place the "try" keyword immediately after the assignment operator:

  if (var value = try cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 2

Place the "try" keyword immediately before the variable type:

  if (try var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 3

Place the "try" keyword immediately after the "if" keyword:

  if try (var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 4

Replace the "if" keyword with "try":

  try (var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 5

Replace the "if" keyword with "try" and add an optional "catch" clause:

  try (var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    AssignmentFailed();
  } catch (Exception e) {
    ExceptionRaised();
  }

Option 5 shows promise because you have access to the precise exception that was raised in the evaluation of the expression, but it's not immediately obvious which failure clause gets executed under all circumstances.

As I said, it wasn't a full-blown "Eureka!" moment, but it looks promising. It is a deviation from the standard curly-brace syntax; but then, so is guarded assignment.

Sunday, 19 January 2020

Escapeless, Restartable, Binary Encoding

Imagine you are broadcasting a message over the air by Morse code. The message (or plain text) is made up of words of capital letters separated by spaces:

THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG

But, for some unexplained reason, you cannot transmit spaces – only the twenty-six letters. How would you come up with a protocol for transmitting messages such that a receiver can unambiguously reconstruct the plain text?

The obvious first stab is to transmit the message without spaces and hope the receiver can guess where the spaces should be:

THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG

This is fine for relatively intelligent humans, but fraught with problems for machines. And it's far from "unambiguous":

ADAPTORBIT
BANALLIES
TEACHART
UNCLEARRAY
UNDONEGATE

Another solution is to designate one letter as an escape character and append a discriminator after each occurrence. For instance, we could use "Z" as the escape character and use "ZS" to signify a space and "ZZ" to signify a single "Z" in the original message:

THEZSQUICKZSBROWNZSFOXZSJUMPSZSOVERZSTHEZSLAZZYZSDOG

Escaping is common in programming language source files with a finite character set: for example, "\" is utilised in string constants in many "curly brace" languages. Generally, one picks an "unlikely" character as the escape.

There are two problems with escaping in this manner:
  1. The length of the encoded message depends on the number of occurrences of the escape character in the plain text. A pathological case is where the plain text consists solely of repeats of the escape character (e.g. "ZZZZZZZZ"); this results in an encoding that is twice the size of the original.
  2. If part of the message is lost or corrupted, it is very difficult to reconstruct (or re-synchronise) the plain text particular when an escape sequence is "cracked" (e.g. "...SBROWNZ...")
We can address Point 2 by using a more sophisticated encoding scheme that is "restartable". This means that if part of the message is corrupt or missing we can, at least, re-synchronise with the data stream at a later point. An example of a restartable stream is UTF-8: you can always distinguish the start of a new code point from a continuation by looking at the top two bits of the byte. However, this doesn't solve the problem of variable encoding size and pathological bloat.

In encoding theory, this problem is characterised as encoding data in one alphabet into another smaller alphabet. In the case above, we're trying to encode a message in a 27-code alphabet into a 26-code one.

In real life, it's quite common to want to transmit arbitrary 8-bit binary data over a one-way channel as a series of distinct messages. If the communications channel isn't perfect, we probably want to be able to re-synchronise to the start of the next message after a failure. If the channel is indeed one-way, we cannot ask the sender to re-transmit; so it is up to the receiver to recover as best they can.

If we use a unique sentinel between the messages, this process is analogous to trying to encode a 257-code alphabet (256 byte values + 1 sentinel) using a 256-code alphabet (a byte stream).

This looks like a tall order and if you naively try to convert an arbitrarily-long, base-257 number to base-256 you quickly run out of steam. However, a nice trick outlined by Ivan Kosarev leads to a breakthrough.

Supposing we separate the binary messages in their encoded form using a single zero byte: 0x00. We must be careful not to use 0x00 within the encoded message otherwise we wouldn't be able to re-synchronise to the start of the next message after a failure. We could choose a non-zero byte value as a substitute for 0x00 in the original message and replace all occurrences of zero with the zero-substitute value. However, how do we choose the zero-substitute value in such a way that it is guaranteed not to occur in the original message?

Consider dividing the original messages up into 254-byte chunks. Within each chunk, there is guaranteed to be at least one non-zero byte value that does not occur. (We could not use 255-byte chunks because of the edge case where all byte values 0x01 to 0xFF occur leaving no non-zero value for the zero-substitute) For each chunk, pick the highest byte value that does not occur as the zero-substitute for that chunk.

Our encoding algorithm then becomes:

for each message:
  split message into chunks of 254 bytes (last may be smaller)
  for each chunk:
    substitute := greatest byte value not present within chunk
    output substitute
    for each byte in chunk:
      if byte == 0x00:
        output substitute
      else:
        output byte
  output 0x00

This isn't quite a true streaming algorithm because you've got to collect 254-byte chunks to find the appropriate zero-substitute. But it's close.

Decoding requires less bookkeeping:

message := empty byte buffer
count := 0
for each byte in input stream:
  if byte == 0x00:
    output message
    reset message to empty byte buffer
    count := 0
  else if count == 0:
    substitute := byte
    count := 254
  else
    count := count - 1
    if byte == sentinel:
      append 0x00 to message
    else:
      append byte to message

Both the encoder and decoder can be implemented on machines with very limited resources. Neither needs to know anything about the format of the messages they are processing.

If a failure is detected, you can quickly fast-forward the data stream to the beginning of the next message by searching for a 0x00 byte. Note that we assume that some robust checksum is encoded into the messages to detect corruption with high probability.

Crucially, the relationship between the size of the plain messages and their encoded forms does not rely on the content of the messages:

encoded_size == floor((plain_size * 255 + 253) / 254) + 1

The "encoded_size" includes the zero byte separator between messages. For large messages, the data overhead (excluding the separator) tends towards less than 0.4%. For small messages (less than the chunk size), the overhead is never more than one byte.

An interesting variant is to XOR the current substitute value with all bytes so as to obfuscate the payload:

for each message:
  split message into chunks of 254 bytes (last may be smaller)
  for each chunk:
    substitute := greatest byte value not present within chunk
    output substitute
    for each byte in chunk:
      output (byte xor substitute)
  output 0x00

Decoding with obfuscation:

message := empty byte buffer
count := 0
for each byte in input stream:
  if byte == 0x00:
    output message
    reset message to empty byte buffer
    count := 0
  else if count == 0:
    substitute := byte
    count := 254
  else
    count := count - 1
    append (byte xor substitute) to message

Tuesday, 22 October 2019

An Alphabet for the Digital Age

Remember those glossy posters at school to help you learn the alphabet? I'm assuming they still use them. They probably need updating from this...

Source https://doverhistorian.com/2013/11/19/ragged-school/
Perhaps the following?
I'm a bit disappointed with the entries for "Q" and "X", and "Y" is a bit obscure. I think I'll start lobbying the Unicode Consortium and get some strategically-named code points into the next release.

Proofreader T-Shirt

Founder Member of the the National Society of Proofreaders

Thursday, 31 January 2019

egg Syntax 3

As promised, I've created a guide to the egg programming language based around its syntax.

Even though it doesn't go into the nitty-gritty of aspects such as type schema, built-in functions or attributes, it's still nearly thirty pages long and took considerably longer than expected to put together. Most of the space is take up with pretty railroad diagrams and example code. But some of the delay was due to the realisation that my thinking about the type system was somewhat muddled. I've tinkered with the EBNF quite a bit in the last few weeks and now feel a lot more comfortable with it.

Short email exchanges with Profs Barbara Liskov and Niklaus Wirth convinced me that the type system should only be prevalent at the interfaces between modules. Within modules it shouldn't get in the way. I've always thought that you spend far too long in languages like C++ fussing over the concrete types (classes) instead of the functionality (functions).

Some computer languages alleviate these pressures by having sophisticated type inference, but that can cause headaches for learners (and non-learners!) who can't grok the inference rules, which are usually fiendish. For example, consider:

  int c = 10;
  var fn = (a, b) => a + b + c;

To a human reader, "fn" is obviously a function that adds its two parameters to "c". But what are the types of its parameters and its return value? Integers? Floats? Strings? A mixture? The compiler could deduce additional information by looking at the later usages of "fn" but this would make the inference of its type "non-local" and therefore potentially confusing.

The egg language has "function expressions" which are like strongly-typed "lambda expressions" (e.g. C++ lambdas):

  int c = 10;
  var fn = float(int a, float b) { return a + b + c };

These are unambiguous, but are a little clumsy. So egg accepts lambda expressions (which it converts to function expressions) providing that the types can be trivially inferred. For example:

  type Adder = float(int, float);
  int c = 10;
  Adder fn = (a, b) => a + b + c;

This isn't much of a "win" in the example above, but if the type is inferred by matching function parameters, it leads to more fluent syntax:

  float process(float(int, float) fn) {
    ...
  }
  int c = 10;
  var result = process((a, b) => a + b + c);

When inferring function types this way, "trivial" means that only the arity (i.e. the number of parameters) of the function is considered. This is further simplified by the restriction that. in egg, lambdas cannot be variadic, only true functions can. For example:

  type Fn0 = void();
  type Fn1 = void(string);
  type Fn3 = void(int,string,int);
  void process(Fn0|Fn1|Fn3 fn, string s, int a, int b) {
    if (Fn0 fn0 = fn) {
      fn0();
    } else if (Fn1 fn1 = fn) {
      fn1(s);
    } else if (Fn3 fn3 = fn) {
      fn3(a, s, b);
    } else {
      throw "internal error";
    }
  }
  process((a, s, b) => { print(a, s, b, "\n"); }); // prints "1+2"

This is quite a sophisticated dispatch mechanism for a language that doesn't permit function overloading. It allows library implementers to support algorithmic optimisation based on lambda signatures without increasing the cognitive load for people not interested in this level of detail.

Oh, I've also updated the A3 poster.

Friday, 11 January 2019

A Crowded Chronology of Cities

For a change of pace, I've been working on a web page to illustrate the most populous urban areas throughout history: A Crowded Chronology of Cities



It's interesting that the graph of the population of the largest city isn't monotonic. As far as I can tell, there were two main "reduction" events:
  • The fall of the Roman Empire
  • The fourteenth century plagues (that's my guess, anyway!)

Sunday, 4 November 2018

RGB/HSV in HLSL 8

Andrew (K-Be) emailed me this week with a problem he found in my HLSL code for HCL-to-RGB conversion. Here's the original code:

  float HCLgamma = 3;
  float HCLy0 = 100;
  float HCLmaxL = 0.530454533953517; // == exp(HCLgamma / HCLy0) - 0.5
  float PI = 3.1415926536;
 
  float3 HCLtoRGB(in float3 HCL)
  {
    float3 RGB = 0;
    if (HCL.z != 0)
    {
      float H = HCL.x;
      float C = HCL.y;
      float L = HCL.z * HCLmaxL;
      float Q = exp((1 - C / (2 * L)) * (HCLgamma / HCLy0));
      float U = (2 * L - C) / (2 * Q - 1);
      float V = C / Q;
      // PROBLEM HERE...
      float T = tan((H + min(frac(2 * H) / 4, frac(-2 * H) / 8)) * PI * 2);
      H *= 6;
      if (H <= 1)
      {
        RGB.r = 1;
        RGB.g = T / (1 + T);
      }
      else if (H <= 2)
      {
        RGB.r = (1 + T) / T;
        RGB.g = 1;
      }
      else if (H <= 3)
      {
        RGB.g = 1;
        RGB.b = 1 + T;
      }
      else if (H <= 4)
      {
        RGB.g = 1 / (1 + T);
        RGB.b = 1;
      }
      else if (H <= 5)
      {
        RGB.r = -1 / T;
        RGB.b = 1;
      }
      else
      {
        RGB.r = 1;
        RGB.b = -T;
      }
      RGB = RGB * V + U;
    }
    return RGB;
  }

Note the calculation of 'T'. Let's split that expression in two:

      float A = H + min(frac(2 * H) / 4, frac(-2 * H) / 8);
      float T = tan(A * PI * 2);

We can see that 'T' will tend to infinity when 'A' approaches 0.25 or 0.75. A bit of careful graphing of 'H' against 'A' suggests that this only occurs when the input hue approaches 1/6 or 2/3 respectively, so we can put extra checks in the sextant clauses:

      float A = (H + min(frac(2 * H) / 4, frac(-2 * H) / 8)) * PI * 2;
      float T;
      H *= 6;
      if (H <= 0.999)
      {
        T = tan(A);
        RGB.r = 1;
        RGB.g = T / (1 + T);
      }
      else if (H <= 1.001)
      {
        RGB.r = 1;
        RGB.g = 1;
      }
      else if (H <= 2)
      {
        T = tan(A);
        RGB.r = (1 + T) / T;
        RGB.g = 1;
      }
      else if (H <= 3)
      {
        T = tan(A);
        RGB.g = 1;
        RGB.b = 1 + T;
      }
      else if (H <= 3.999)
      {
        T = tan(A);
        RGB.g = 1 / (1 + T);
        RGB.b = 1;
      }
      else if (H <= 4.001)
      {
        RGB.g = 0;
        RGB.b = 1;
      }
      else if (H <= 5)
      {
        T = tan(A);
        RGB.r = -1 / T;
        RGB.b = 1;
      }
      else
      {
        T = tan(A);
        RGB.r = 1;
        RGB.b = -T;
      }

Of course, if you're confident that your platform won't throw too much of a wobbly when computing 'tan' of half pi et al, you can hoist the calculation of 'T' to its declaration before the 'if' clauses. You never know your luck: your shader compiler might to that for you!

Having said all that, the more I read about the HCL colour space, the less I'm convinced it's actually worthwhile.