Wednesday, 30 May 2018

egg Chronology of Computer Languages 4

I couldn't resist re-visiting this after the Bank Holiday weekend.

After a bit more research, the type of diagram I'm actually looking for for COCL is a layered graph drawing, also known as a Sugiyama diagram. As suspected, it's a difficult problem to lay out these diagrams but there are some well-known heuristics. Placing a descendant at an x-position that is an average (mean, median, etc.) of its ancestors' x-positions helps reduce edge crossings. It also seems to organise the programming languages of COCL into related groups.

So I did a bit more fettling of the box-drawing version of the data-set and then produced a colourful SVG version based on that layout data. The only slight wrinkle was getting the arcs looking good when there isn't enough room for a full ninety-degree elbow. See the purple link between "Lisp" and "Logo" below:

I didn't animate the final SVG because it's just too expensive computationally and there's no scope for caching frames like I did with the canvas-based "waterfall" version.


The opacity of each connection is a function of the time in years between the appearance of the ancestor and the descendant. I'm not sure it's strictly meaningful, but it adds variety.

Saturday, 26 May 2018

egg Chronology of Computer Languages 3

As one last task, I looked at producing a more traditional "flow chart" from the COCL data-set. Yet again, it turned out to be surprisingly difficult.

Sankey diagram is a close match for what I'm looking for: an ordered DAG of ancestors and descendants. But, in all the little tools I briefly looked at, the layouts of the final charts leave a lot to be desired.

I started work on a basic network layout engine in C++ but quickly came to a dead end: the number of permutations for the node positioning alone is in the order of 1e+96 which is just not tractable. And this doesn't even take into consideration the edge routing.

I finally plumped for two tools (one in C++ and the other in JavaScript) to semi-automate the layout and then continually fettled it by hand (for instance, lining up the C and Pascal families of languages vertically) until it was relatively compact.

The final result (bottom) displays the diagram using fixed-width, box-drawing characters:



Wednesday, 23 May 2018

egg Chronology of Computer Languages 2

Another visualisation I tried with the COCL data-set was a chord diagram (revamped from the cocktail version I did a couple of years ago). Apart for realising how bad my JavaScript was back then, I discovered that chord diagrams do not make the "direction" of relationships obvious. I've added a bit of animation to try to remedy that when you hover over a language name.
I find this visualisation much more aesthetically pleasing that the rather workman-like "waterfall". The fact that the chronology is clock-like is also rather agreeable.

I turns out that drawing "marching ant" lines is quite expensive, even in modern browsers; so I re-vamped the "waterfall" diagram from yesterday to cache frames. It means that the animation is a bit jittery for the first few seconds after landing on the web page, but quickly becomes very smooth.

Click here for the interactive versions.

Tuesday, 22 May 2018

egg Chronology of Computer Languages 1

I was thinking about "curly brace languages" in the context of egg at the weekend, and remembered a graphic I'd seen of the "family tree" of computer programming languages. I googled around and found the images I'd seen at the turn of the millennium (e.g. here and here), but nothing from the last ten years. Strange, I thought; I'll have a go at either updating or redrawing a tree.

It quickly became apparent why I couldn't find such a chart.

I started by manually scraping the data from the Wikipedia pages and producing a spreadsheet of dependencies with:

  • Language name
  • Year it first appeared
  • Ancestors (prior languages that are said to have influenced it)
  • Link to Wikipedia page
That dataset (COCL: A Chronology of Computer Languages) is stored as JavaScript here. There are over sixty data points ranging from FORTRAN (1957) to Swift (2014). The choice of points is fairly subjective, the data is direct from Wikipedia.

Then I tried to draw the recombining tree (DAG) by hand. I got this far and gave up:


Even just concentrating on a subset (the "curlies") didn't help much:


The problem is that I'm trying to get the ordering of the chronology fixed in one of the axes (the y-axis in this case: oldest at the top, newest at the bottom). I've tried curved lines and even duplicated nodes to try to minimise the number of crossing edges. I've tried putting the oldest languages in the centre of the canvas and working outwards, like a bulls-eye. But nothing seems to work. I'm sure there are technique and/or software out there that can solve this problem, but the linear ordering seems to be beyond the ones I briefly looked at.

So I decided to throw a bit of JavaScript at it. It turns out it's easy to get ugly results from the dataset; the trick is to make it pretty.

My first attempt is an animated waterfall chart:


It's not very elegant, but does give some useful insights into popular ancestors. Everyone seems to want Lisp in their language's DNA these days.

Click here for the interactive version.

Tuesday, 1 May 2018

egg Strings

Currently, strings are a fundamental type in egg. In many languages they are not. I'm still undecided how much compiler and language support to provide. In particular, because egg is supposed to be a stable language, if strings are a fundamental type, any decisions I make now will need to be "baked" into the specification. So I've been looking into what strings actually are.

In languages where strings are not fundamental, such as C++, there may be many representations of text:

  char*
  char[N]
  std::string
  std::wstring
  std::u16string
  std::u32string

This has the advantage that the different representations and implementations can be tailored for different criteria, but it can be a bit overwhelming for learners.

Strings in egg are simply sequences of Unicode characters. They are immutable (like Java) and do not directly expose their internal representation. They are also independent of your locale.

So what can you do with strings like these? The Wikipedia pages leave a bit to be desired on this subject, so I had to start from first principles.

Assignment

Strings can be assigned from string literals or from other strings:

  string greeting = "hello";
  string word = greeting;

Concatenation

Strings can be created by concatenating other strings using the type-name constructor:

  string greeting = string("hello"); // "hello"
  greeting = string(greeting," world"); // "hello world"

Property: int length

Strings have a single property, "length", which is the count of the Unicode code points (not code units) that make up the string:

  string greeting = "hello";
  print(greeting.length); // prints '5'

Operator: string[int]

Strings have an indexing operator to extract single-character substrings at the relevant integer code point index:

  string greeting = "hello";
  string letter = greeting[0]; // sets letter to "h"

If the index is negative, or beyond the end of the string, an exception is thrown.

Operators: == and !=

Strings have equality/inequality operators:

  print("hello" == "hello"); // prints 'true'
  print("hello" == "goodbye"); // prints 'false'
  print("hello" != "hello"); // prints 'false'
  print("hello" != "goodbye"); // prints 'true'

Equality is case-sensitive and based on the equality of the constituent code point sequence. Strings are never equal to non-string values.

Iteration

The code points that make up a string can be iterated:

  string greeting = "hello";
  for (string i : greeting) print(i); // prints 'h' 'e' 'l' 'l' 'o'


Note that the iteration returns single-character strings.

Member: int hash()

Computes a 64-bit signed hash integer for use in containers and algorithms.

Member: int compare(string other)

Compares the string to another in an invariant manner (code point by code point). Returns negative/zero/position depending on whether the string is less/equal.greater than the other.

  "hello".compare("goodbye") // returns a positive integer

Case-insensitive and locale-specific comparisons are outside the remit of the basic string methods.

To prevent accident assumptions about collation ordering, the relational operators "<", "<=", ">=" and ">" are not directly supported for strings.

Member: bool contains(string needle)

Searches for a substring within the string.

  "beggar".contains("egg") // returns true

This is an optimisation of

  haystack.indexOf(needle) != null

Member: bool startsWith(string needle)

Determines if a string starts with substring.

  "beggar".startsWith("beg") // returns true

This is an optimisation of

  haystack.indexOf(needle) == 0

Member: bool endsWith(string needle)

Determines if a string starts with substring.

  "beggar".endsWith("beg") // returns false

This is an optimisation of

  haystack.indexOf(needle) == (haystack.length - needle.length)

Member: int? indexOf(string needle)

Returns the zero-based index of the first occurrence of the substring, or "null" if the substring is not present.

  "beggar".indexOf("egg") // returns 1

The full signature is

  int? indexOf(string needle,
               int? start_index = null,
               int? limit = null,
               bool? negate = null)

Member: int? lastIndexOf(string needle)

Returns the zero-based index of the last occurrence of the substring or "null" if the substring is not present.

  "beggar".lastIndexOf("g") // returns 3

The full signature is

  int? lastIndexOf(string needle,
                   int? start_index = null,
                   int? limit = null,
                   bool? negate = null)

Member: string replace(string needle, string replacement)

Replaces occurrences of a substring with another string.

  "beggar".replace("egg", "e") // returns "bear"

Remember, the result is returned from this function; the original string is unaltered. The full signature is

  string replace(string needle,
                 string replacement,
                 int? max_occurrences = null)

If "max_occurrences" is negative, replacements are made for the last occurrences of the substring.

Member: string slice(int? begin = null, int? end = null)

Extracts a substring from the string. The parameters are both optional indices. Negative indices count back from the end of the string:

  "beggar".slice(1) // returns "eggar"
  "beggar".slice(1, 4) // returns "egg"
  "beggar".slice(1, -2) // returns "egg"

Member: string[] split(string separator, int? limit = null)

Splits a string into an array of substrings:

  "beggar".split("g") // returns ["be","","ar"]

If "limit" is negative, splits are limited to the last occurrences of the separator.

Member: string join(any[] ...params)

Joins strings together:

  "a".join("b", "n", "n", "") // returns "banana"

Member: string repeat(int count)

Creates a new string by concatenating repetitions of a source string:

  "beggar".repeat(0) // returns ""


  "beggar".repeat(3) // returns "beggarbeggarbeggar"

And that's pretty much it. There are some obvious (and deliberate) omissions:

MISSING Member: string format(any[] ...params)

The functionality for formatting general text will live in separate modules. This will allow for more than one formatting scheme and also allow the algorithms to evolve over time.

MISSING Member: int codePointAt(int index)

This is really a function of the (Unicode) encoding specification, so should be defined there:

  import unicode = /*...*/;
  print(unicode.getCodePointAt("hello", 1)); // prints 101

MISSING Members: string lowercase() and string uppercase()

These are omitted because they rely on the Unicode character database, which changes over time. See here. The functionality will probably be made available via a versioned module:

  import unicode = /*...*/;
  print(unicode.toUpper("hello")); // prints "HELLO"

MISSING Members: string trim(), string trimLeft() and string trimRight()

Removing white-space (or any class of character) from the ends of strings is a common operation. The Unicode standards committee designated twenty-odd code points as "white-space". This includes one of my favourite characters: U+1680: OGHAM SPACE MARK. Alas, the categorisation is somewhat fluid, so one requires access to the Unicode character database to get it right.

MISSING Member: string reverse()

Although characters in egg strings are Unicode, reversing a Unicode string may not produce the results you expect. To do it properly again requires access to the Unicode character database, so the reverse function should belong to a versioned module.

Also, there some members that I'm still dithering about:

POSSIBLE Members: string padLeft() and string padRight()

These members can be synthesised using the existing members. However, they are surprisingly easy to get wrong, so maybe they're a useful addition:

  string padLeft(int min_length, string? padding = null)
  string padRight(int min_length, string? padding = null)

Computers Don't Lie T-Shirt

Computer Don't Lie - They're Just Misinformed