Sunday 29 November 2020

egg Pointer Ambiguity

I've been beavering away on the egg programming language for the last few months. One of the major stumbling blocks has been the type system. As Hisham Muhammad points out, type systems are almost always more complex they first appear to be. I've had to re-implement the compiler and virtual machine several times because of fundamental mistakes I've made with the egg type system, even though it's supposedly "simple."

Here's an example of one problem I'm still struggling with right now.

I'd like to have "safe pointers" in the egg language to handle concepts such as pass-by-reference:

bool safeDivide(float num, float den, float* out) {
  if (den == 0) {
    return false;
  }
  *out = num / dev;
  return true;
}

This all looks hunky-dory, but now consider this:

any v = 123; // line 1
int* p = &v; // line 2
v = "hello"; // line 3

In line 1, we define a variable 'v' that can store most types of value and initialize it with an integer value. In line 2, we define a pointer variable 'p' and point it at 'v'. In line 3, we modify 'v' to be a string. The question is: "What is the value of '*p' after line 3?"

The type declaration of 'p' suggests that '*p' should (always) be an integer, but it's now pointing to a string. There's obviously something "wrong" here, but what exactly is it?

Option A

Line 2 should have produced a compile-time error along the lines of
Cannot initialize a pointer to a value of type 'int' with the address of a value of type 'any'

That is, we only allow pointers to point to values of exactly the appropriate type. This requires that the type of any operand of the address-of '&' operator is known precisely at compile-time.

Option B

Line 3 should have produced a compile-time error because the assignment invalidates the type constraint of 'p'. This requires us to perform some very sophisticated static analysis; I'm not even sure it's possible beyond trivial examples.

Option C

Line 3 produces a runtime error because the assignment invalidates the type constraint of 'p'. This requires us to keep track of all pointers pointing to a value and checking for invalidation on every assignment. This "observer" scheme sounds very expensive to me.

Option D

Produce a runtime error if or when 'p' is subsequently dereferenced and it is discovered that it no longer points to an integer. This would mean that the error is raised "at a distance" from the assignment that caused the issue, thereby making debugging more difficult.

Option E

We make egg pointers "typeless" or, put it another way, all pointers must be of type 'any?*'. This means that if the pointee changes type, we don't really care.

Option F

Nothing is wrong! Just live with the fact that '*p' isn't necessarily an integer, even though it's defined like that.

It all comes down to how the runtime type of the pointee and the compile-time declaration of the pointer interact. I cannot imagine I'm the first to have come across this issue, but a quick search of literature hasn't come up with anything. But then, I don't know what the problem is called, so I'm stumbling in the dark somewhat.

My current "least-hated solution" is a hybrid of Options A and D: try to detect inconsistencies at compile-time but fall back to checking at runtime whenever the pointer is dereferenced.

Sunday 26 July 2020

Colour Names 1

It's common knowledge that web colour names are derived from the X11 colour names. But the relationship is in reality a series of compromises. Here's a comparison list:

ColourX11 Name(s)Web Name(s)RGB
    Alice Bluealiceblue#F0F8FF
    Antique Whiteantiquewhite#FAEBD7
    Aqua
Cyan
aqua
cyan
#00FFFFNote 1
    Aquamarineaquamarine#7FFFD4
    Azureazure#F0FFFF
    Beigebeige#F5F5DC
    Bisquebisque#FFE4C4
    Blackblack#000000
    Blanched Almondblanchedalmond#FFEBCD
    Blueblue#0000FF
    Blue Violetblueviolet#8A2BE2
    Brownbrown#A52A2A
    Burlywoodburlywood#DEB887
    Cadet Bluecadetblue#5F9EA0
    Chartreusechartreuse#7FFF00
    Chocolatechocolate#D2691E
    Coralcoral#FF7F50
    Cornflower Bluecornflowerblue#6495ED
    Cornsilkcornsilk#FFF8DC
    Crimsoncrimson#DC143C
    Dark Bluedarkblue#00008B
    Dark Cyandarkcyan#008B8B
    Dark Goldenroddarkgoldenrod#B8860B
    Dark Gray
Dark Grey
darkgray
darkgrey
#A9A9A9Note 2
    Dark Greendarkgreen#006400
    Dark Khakidarkkhaki#BDB76B
    Dark Magentadarkmagenta#8B008B
    Dark Olive Greendarkolivegreen#556B2F
    Dark Orangedarkorange#FF8C00
    Dark Orchiddarkorchid#9932CC
    Dark Reddarkred#8B0000
    Dark Salmondarksalmon#E9967A
    Dark Sea Greendarkseagreen#8FBC8F
    Dark Slate Bluedarkslateblue#483D8B
    Dark Slate Gray
Dark Slate Grey
darkslategray
darkslategrey
#2F4F4FNote 2
    Dark Turquoisedarkturquoise#00CED1
    Dark Violetdarkviolet#9400D3
    Deep Pinkdeeppink#FF1493
    Deep Sky Bluedeepskyblue#00BFFF
    Dim Gray
Dim Grey
dimgray
dimgrey
#696969Note 2
    Dodger Bluedodgerblue#1E90FF
    Firebrickfirebrick#B22222
    Floral Whitefloralwhite#FFFAF0
    Forest Greenforestgreen#228B22
    Fuchsia
Magenta
fuchsia
magenta
#FF00FFNote 3
    Gainsborogainsboro#DCDCDC
    Ghost Whiteghostwhite#F8F8FF
    Goldgold#FFD700
    Goldenrodgoldenrod#DAA520
    Gray
Grey
X11 Gray
X11 Grey
n/a#BEBEBENote 4
    Green Yellowgreenyellow#ADFF2F
    Honeydewhoneydew#F0FFF0
    Hot Pinkhotpink#FF69B4
    Indian Redindianred#CD5C5C
    Indigoindigo#4B0082
    Ivoryivory#FFFFF0
    Khakikhaki#F0E68C
    Lavenderlavender#E6E6FA
    Lavender Blushlavenderblush#FFF0F5
    Lawn Greenlawngreen#7CFC00
    Lemon Chiffonlemonchiffon#FFFACD
    Light Bluelightblue#ADD8E6
    Light Corallightcoral#F08080
    Light Cyanlightcyan#E0FFFF
    Light Goldenrodlightgoldenrod#FFEC8BNote 5
    Light Goldenrod Yellowlightgoldenrodyellow#FAFAD2Note 5
    Light Gray
Light Grey
lightgray
lightrey
#D3D3D3
    Light Greenlightgreen#90EE90
    Light Pinklightpink#FFB6C1
    Light Salmonlightsalmon#FFA07A
    Light Sea Greenlightseagreen#20B2AA
    Light Sky Bluelightskyblue#87CEFA
    Light Slate Gray
Light Slate Grey
lightslategray
lightslategrey
#778899
    Light Steel Bluelightsteelblue#B0C4DE
    Light Yellowlightyellow#FFFFE0
    Lime Greenlimegreen#32CD32
    Lime
Green
X11 Green
lime#00FF00Note 6
    Linenlinen#FAF0E6
    Maroon
X11 Maroon
n/a#B03060Note 7
    Medium Aquamarinemediumaquamarine#66CDAA
    Medium Bluemediumblue#0000CD
    Medium Orchidmediumorchid#BA55D3
    Medium Purplemediumpurple#9370DB
    Medium Sea Greenmediumseagreen#3CB371
    Medium Slate Bluemediumslateblue#7B68EE
    Medium Spring Greenmediumspringgreen#00FA9A
    Medium Turquoisemediumturquoise#48D1CC
    Medium Violet Redmediumvioletred#C71585
    Midnight Bluemidnightblue#191970
    Mint Creammintcream#F5FFFA
    Misty Rosemistyrose#FFE4E1
    Moccasinmoccasin#FFE4B5
    Navajo Whitenavajowhite#FFDEAD
    Navy
Navy Blue
navy#000080Note 8
    Old Laceoldlace#FDF5E6
    Oliveolive#808000
    Olive Drabolivedrab#6B8E23
    Orangeorange#FFA500
    Orange Redorangered#FF4500
    Orchidorchid#DA70D6
    Pale Goldenrodpalegoldenrod#EEE8AA
    Pale Greenpalegreen#98FB98
    Pale Turquoisepaleturquoise#AFEEEE
    Pale Violet Redpalevioletred#DB7093
    Papaya Whippapayawhip#FFEFD5
    Peach Puffpeachpuff#FFDAB9
    Peruperu#CD853F
    Pinkpink#FFC0CB
    Plumplum#DDA0DD
    Powder Bluepowderblue#B0E0E6
    Purple
X11 Purple
n/a#A020F0Note 9
    Rebecca Purplerebeccapurple#663399Note 10
    Redred#FF0000
    Rosy Brownrosybrown#BC8F8F
    Royal Blueroyalblue#4169E1
    Saddle Brownsaddlebrown#8B4513
    Salmonsalmon#FA8072
    Sandy Brownsandybrown#F4A460
    Sea Greenseagreen#2E8B57
    Seashellseashell#FFF5EE
    Siennasienna#A0522D
    Silversilver#C0C0C0
    Sky Blueskyblue#87CEEB
    Slate Blueslateblue#6A5ACD
    Slate Gray
Slate Grey
slategray
slategrey
#708090Note 2
    Snowsnow#FFFAFA
    Spring Greenspringgreen#00FF7F
    Steel Bluesteelblue#4682B4
    Tantan#D2B48C
    Tealteal#008080
    Thistlethistle#D8BFD8
    Tomatotomato#FF6347
    Turquoiseturquoise#40E0D0
    Violetviolet#EE82EE
    Web Gray
Web Grey
gray
grey
#808080Note 4
    Web Greengreen#008000Note 6
    Web Maroonmaroon#800000Note 7
    Web Purplepurple#800080Note 9
    Wheatwheat#F5DEB3
    Whitewhite#FFFFFF
    White Smokewhitesmoke#F5F5F5
    Yellowyellow#FFFF00
    Yellow Greenyellowgreen#9ACD32

Notes:
  1. "Aqua" and "Cyan" are synonyms and refer to the same RGB colours.
  2. In X11 and hence web names, "Gray" (en-us) and "Grey" (en-gb) are synonyms.
  3. "Fuchsia" and "Magenta" are synonyms and refer to the same RGB colours.
  4. There is no equivalent of X11 "Gray/Grey" in the web colours. The closest is named "Web Gray/Gray" in X11 and maps to "gray/grey" in the web namespace.
  5. "Light Goldenrod" and "Light Goldenrod Yellow" are two distinct colours in both namespaces. Many online lists conflate the two.
  6. "Green" is different in the X11 (#00FF00) and Web (#008000) namespaces.
  7. "Maroon" is different in the X11 (#B03060) and Web (#800000) namespaces.
  8. "Navy/Navy Blue" can only be officially named "navy" in the Web namespace.
  9. "Purple" is different in the X11 (#A020F0) and Web (#800080) namespaces.
  10. "Rebecca Purple" was added to X11 and CSS 4.1.
If we exclude the three X11 colours ("Gray/Grey", "Maroon" and "Purple") that have no web equivalent, we are left with the 140 unique web colours.

Friday 3 July 2020

ZX81 Characters in Unicode 13.0

Almost exactly ten years ago, in an earlier post, I bemoaned the fact that a few glyphs from the ZX81 character set were missing from the Unicode standard. However, whilst researching seven-segment displays, I came across a reference to part of Unicode 13.0 named "Symbols for Legacy Computing". This obviously piqued my interest.



Within the upcoming release of the Unicode standard are:
214 graphic characters that provide compatibility with various home computers from the mid-1970s to the mid-1980s and with early teletext broadcasting standards.
Lo and behold, characters U+1FBF0 to U+1FBF9 are the ten seven-segment digits that first brought my attention to this block, whilst four other codepoints provide the missing ZX81 glyphs:

 ZX81Unicode Description 
 0x09 U+1FB8F LOWER HALF MEDIUM SHADE
 0x0A U+1FB8E UPPER HALF MEDIUM SHADE
 0x89 U+1FB91 UPPER HALF BLOCK AND LOWER HALF INVERSE MEDIUM SHADE
 0x8A U+1FB92 UPPER HALF INVERSE MEDIUM SHADE AND LOWER HALF BLOCK

Someone has even updated the Wikipedia page. Marvellous.

The ZX81 was launched in March 1981, but its character set was evolved from that used in the ZX80 which launched in January 1980. Then, in March 2020, Unicode was extended with characters that finally filled in the blanks. Forty years later? Crikey!

By the way, the eagle-eyed may have noticed that U+1FB94 ("LEFT HALF INVERSE MEDIUM SHADE AND RIGHT HALF BLOCK") does not have a mirrored twin ("LEFT HALF BLOCK AND RIGHT HALF INVERSE MEDIUM SHADE") which looks like it should fit at U+1FB93 (currently flagged "<reserved>"). However, the fascinating Working Group Proposal explains that
One code point, U+1FB93, was left unassigned in this proposal as a placeholder for the as-yet unattested *LEFT HALF BLOCK AND RIGHT HALF INVERSE MEDIUM SHADE, which would be the reverse-video equivalent of U+1FB8D RIGHT HALF MEDIUM SHADE from the Aquarius.
After briefly investigating the Aquarius character set, I feel Unicode 13.1 may be imminent.

Arm Clock

Seven-segment displays have long been used in digital clocks as they are a natural fit for LCD and LED technologies. There are even electro-mechanical displays for larger applications. However, for some reason I wondered if there was an alternative to flipping each segment in a binary fashion.

The ten digits are typically rendered in the following fashion:


If we want to produce these shapes using articulated arms (with each black segment being a "bone" and each grey dot being a "joint") we quickly see two constraints:
  1. There must be at least seven bones.
  2. The "root" of any arm must be located at one of the three right-hand pivot points.
If we further limit ourselves to the minimum number of bones (seven) it becomes apparent that we need at least two arms: try to manipulate a single seven-bone arm to make the "8" digit without breaking it.

I chose to use two arms: one rooted in the top-right corner with four bones (red), the other rooted at the middle-right pivot with three bones (blue):


By manipulating the angles between the bones, these two arms can fold themselves into any of the shapes making up the ten digits above:


If we wish to physically build this arrangement, there are two main problems to overcome:
  1. Bone are sometimes coincident (the joints fold back on themselves 180 degrees)
  2. The joint motors (or linkages) have to somehow fit in the arm bones
But these are merely robotics issues.

So we can theoretically create the shapes of digits to form a clock. But can we create some feasible transitions to animate it? If we assume a twenty-four hour clock, it turns out there are thirteen transitions we need to concern ourselves with:
  1. "0" → "1"
  2. "1" → "2"
  3. "2" → "3"
  4. "3" → "4"
  5. "4" → "5"
  6. "5" → "6"
  7. "6" → "7"
  8. "7" → "8"
  9. "8" → "9"
  10. "9" → "0"
  11. "2" → "0" (e.g. "23:59:59" to "00:00:00")
  12. "3" → "0" (e.g. "23:59:59" to "00:00:00")
  13. "5" → "0" (e.g. "23:59:59" to "00:00:00")
By manually fettling the angles of the arm joints, I came up with a fairly pleasing series of transitions that don't flail the arms too much, clattering them into adjacent digits.

The resulting web page uses CSS transitions to animate an inline SVG. Both the CSS rules and SVG elements are created programmatically in a fairly concise JavaScript script.


Monday 18 May 2020

United Nations Flags

There are 193 full members of the United Nations. I've put together a web page animating between those flags based on the SVG images in Wikimedia Commons.


The process was:

  1. Curate the 193 state flags plus the flag of the UN itself.
  2. Construct 194 approximations of those flags using JavaScript and HTML canvas elements.
  3. Reorder those approximations into a loop of 194 transitions based on similarity.
  4. Encode the 194 animations.
  5. Create a web page to allow the user to explore the animations.
Of course, there are 194! ÷ 2 animations needed to get between any two flags directly, but that's such an absurd number I didn't even think about trying. I did think about creating "shortcuts" between flags in the loop and minimising the number of animations between two arbitrary flags to, say, 10, but this is still a huge amount of work. I'll leave it as an exercise for the reader to work out the total number of animations required to reduce the maximum number of hops to "N".

In an ideal world, I'd animate the high-quality SVG images themselves; but that's also a task for another day ... if ever.

Sunday 9 February 2020

egg Try Operator

I had a bit of an epiphany earlier in the year. Not one of "Archimedean" proportions (no bathwater was at risk), but I seem to remember audibly muttering "Ooh" at the time. It concerned egg guarded assignment statements.

In many programs, it's common to attempt a data access that you know will fail quite often. For example, looking up an entry in a cache. Programming languages and their libraries provide a number of patterns for doing this:

  // C++
  auto found = cache.find(key);
  if (found != cache.end()) {
    CacheHit(*found);
  } else {
    CacheMiss();
 }

or

  // C#
  if (cache.TryGet(key, out var found)) {
    CacheHit(found);
  } else {
    CacheMiss();
  }

or

  // JavaScript
  if (key in cache) {
    CacheHit(cache[key]);
  } else {
    CacheMiss();
  }

or

  // Java
  if (cache.containsKey(key)) {
    CacheHit(cache.get(key));
  } else {
    CacheMiss();
  }

The C# paradigm (a.k.a. "try-get") is arguably the most elegant (it's atomic and concise), but it leads to a bloated interface for the containers:

  // C#
  interface ICache<K,V> : ...
  {
    bool TryGet(K key, out V value);
    V Get(K key);
    ...
  }

In, egg these interface members look like the following functions:

  // egg
  type<K,V> bool TryGet(K key, V* value);
  type<K,V> V Get(K key);

Both these members perform essentially the same task, but the "TryGet()" method will return "false" if the key doesn't exist, whereas "Get()" will throw an exception.

If we don't supply the "TryGet()" member, we need to use the following paradigm to emulate it:

  // egg
  bool found;
  K value;
  try {
    value = cache.Get(key);
    found = true;
  } catch (Exception e) {
    found = false;
  }

Whereas if we only supply "TryGet()", we need to use the following to emulate "Get()":

  // egg
  K value;
  if (!cache.TryGet(key, &value)) {
    throw ExceptionKeyNotFound();
  }

Like C# 7 pattern matching, the egg language has the concept of guarded assignments:

  if (type variable = expression) {
    assignment-successful
  } else {
    assignment-failed
  }

But this doesn't handle the case where the evaluation of "expression" raises an exception: you'd still need to wrap the whole thing up in a "try-catch" block.

My original idea was to allow functions to return a "void" value which can never be successfully assigned to a variable. You could then use the following pattern:

  // egg
  type<K,V> V|void Fetch(K key);
  ...
  if (var value = cache.Fetch(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

This is good for guarded assignments, but there's no informative exception payload for standard assignments or other evaluations. The following:

  // egg
  var value = cache.Fetch(key);

will just report an unhelpful "unable to assign 'void' value" exception at run-time.

My almost-"Eureka!" moment came when I considered adding a "try" operator to the language:

  type variable = try expression

The new operator is prefix unary and attempts to evaluate its operand at run-time. If it succeeds, the result is the value of the expression. If an exception is raised during the evaluation of the expression, the result of the operator is "void".

In general, this operator only makes sense inside a guarded assignment:

  // egg
  type<K,V> V Get(K key);
  ...
  if (var value = try cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

or, possibly, in an "if-try" condition:

  // egg
  if (try PerformSomething()) {
    Success();
  } else {
    Failure();
  }

In the latter case, the "try" operator takes an operand that evaluates to "void" and returns "false" or "true" depending on whether an exception is thrown. There's not a great benefit of this form beyond the usual standard "try-catch" syntax, and it hides the cause of the exception, so I'm less inclined to support it.

I haven't decided on the precise syntax for "guarded try assignments" (or indeed a better name!) but here are a few options:

Option 1 (currently preferred)

Place the "try" keyword immediately after the assignment operator:

  if (var value = try cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 2

Place the "try" keyword immediately before the variable type:

  if (try var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 3

Place the "try" keyword immediately after the "if" keyword:

  if try (var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 4

Replace the "if" keyword with "try":

  try (var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    CacheMiss();
  }

Option 5

Replace the "if" keyword with "try" and add an optional "catch" clause:

  try (var value = cache.Get(key)) {
    CacheHit(value);
  } else {
    AssignmentFailed();
  } catch (Exception e) {
    ExceptionRaised();
  }

Option 5 shows promise because you have access to the precise exception that was raised in the evaluation of the expression, but it's not immediately obvious which failure clause gets executed under all circumstances.

As I said, it wasn't a full-blown "Eureka!" moment, but it looks promising. It is a deviation from the standard curly-brace syntax; but then, so is guarded assignment.

Sunday 19 January 2020

Escapeless, Restartable, Binary Encoding

Imagine you are broadcasting a message over the air by Morse code. The message (or plain text) is made up of words of capital letters separated by spaces:

THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG

But, for some unexplained reason, you cannot transmit spaces – only the twenty-six letters. How would you come up with a protocol for transmitting messages such that a receiver can unambiguously reconstruct the plain text?

The obvious first stab is to transmit the message without spaces and hope the receiver can guess where the spaces should be:

THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG

This is fine for relatively intelligent humans, but fraught with problems for machines. And it's far from "unambiguous":

ADAPTORBIT
BANALLIES
TEACHART
UNCLEARRAY
UNDONEGATE

Another solution is to designate one letter as an escape character and append a discriminator after each occurrence. For instance, we could use "Z" as the escape character and use "ZS" to signify a space and "ZZ" to signify a single "Z" in the original message:

THEZSQUICKZSBROWNZSFOXZSJUMPSZSOVERZSTHEZSLAZZYZSDOG

Escaping is common in programming language source files with a finite character set: for example, "\" is utilised in string constants in many "curly brace" languages. Generally, one picks an "unlikely" character as the escape.

There are two problems with escaping in this manner:
  1. The length of the encoded message depends on the number of occurrences of the escape character in the plain text. A pathological case is where the plain text consists solely of repeats of the escape character (e.g. "ZZZZZZZZ"); this results in an encoding that is twice the size of the original.
  2. If part of the message is lost or corrupted, it is very difficult to reconstruct (or re-synchronise) the plain text particular when an escape sequence is "cracked" (e.g. "...SBROWNZ...")
We can address Point 2 by using a more sophisticated encoding scheme that is "restartable". This means that if part of the message is corrupt or missing we can, at least, re-synchronise with the data stream at a later point. An example of a restartable stream is UTF-8: you can always distinguish the start of a new code point from a continuation by looking at the top two bits of the byte. However, this doesn't solve the problem of variable encoding size and pathological bloat.

In encoding theory, this problem is characterised as encoding data in one alphabet into another smaller alphabet. In the case above, we're trying to encode a message in a 27-code alphabet into a 26-code one.

In real life, it's quite common to want to transmit arbitrary 8-bit binary data over a one-way channel as a series of distinct messages. If the communications channel isn't perfect, we probably want to be able to re-synchronise to the start of the next message after a failure. If the channel is indeed one-way, we cannot ask the sender to re-transmit; so it is up to the receiver to recover as best they can.

If we use a unique sentinel between the messages, this process is analogous to trying to encode a 257-code alphabet (256 byte values + 1 sentinel) using a 256-code alphabet (a byte stream).

This looks like a tall order and if you naively try to convert an arbitrarily-long, base-257 number to base-256 you quickly run out of steam. However, a nice trick outlined by Ivan Kosarev leads to a breakthrough.

Supposing we separate the binary messages in their encoded form using a single zero byte: 0x00. We must be careful not to use 0x00 within the encoded message otherwise we wouldn't be able to re-synchronise to the start of the next message after a failure. We could choose a non-zero byte value as a substitute for 0x00 in the original message and replace all occurrences of zero with the zero-substitute value. However, how do we choose the zero-substitute value in such a way that it is guaranteed not to occur in the original message?

Consider dividing the original messages up into 254-byte chunks. Within each chunk, there is guaranteed to be at least one non-zero byte value that does not occur. (We could not use 255-byte chunks because of the edge case where all byte values 0x01 to 0xFF occur leaving no non-zero value for the zero-substitute) For each chunk, pick the highest byte value that does not occur as the zero-substitute for that chunk.

Our encoding algorithm then becomes:

for each message:
  split message into chunks of 254 bytes (last may be smaller)
  for each chunk:
    substitute := greatest byte value not present within chunk
    output substitute
    for each byte in chunk:
      if byte == 0x00:
        output substitute
      else:
        output byte
  output 0x00

This isn't quite a true streaming algorithm because you've got to collect 254-byte chunks to find the appropriate zero-substitute. But it's close.

Decoding requires less bookkeeping:

message := empty byte buffer
count := 0
for each byte in input stream:
  if byte == 0x00:
    output message
    reset message to empty byte buffer
    count := 0
  else if count == 0:
    substitute := byte
    count := 254
  else
    count := count - 1
    if byte == sentinel:
      append 0x00 to message
    else:
      append byte to message

Both the encoder and decoder can be implemented on machines with very limited resources. Neither needs to know anything about the format of the messages they are processing.

If a failure is detected, you can quickly fast-forward the data stream to the beginning of the next message by searching for a 0x00 byte. Note that we assume that some robust checksum is encoded into the messages to detect corruption with high probability.

Crucially, the relationship between the size of the plain messages and their encoded forms does not rely on the content of the messages:

encoded_size == floor((plain_size * 255 + 253) / 254) + 1

The "encoded_size" includes the zero byte separator between messages. For large messages, the data overhead (excluding the separator) tends towards less than 0.4%. For small messages (less than the chunk size), the overhead is never more than one byte.

An interesting variant is to XOR the current substitute value with all bytes so as to obfuscate the payload:

for each message:
  split message into chunks of 254 bytes (last may be smaller)
  for each chunk:
    substitute := greatest byte value not present within chunk
    output substitute
    for each byte in chunk:
      output (byte xor substitute)
  output 0x00

Decoding with obfuscation:

message := empty byte buffer
count := 0
for each byte in input stream:
  if byte == 0x00:
    output message
    reset message to empty byte buffer
    count := 0
  else if count == 0:
    substitute := byte
    count := 254
  else
    count := count - 1
    append (byte xor substitute) to message