Tuesday 24 July 2018

egg Sequences 1

Sequences of items are a recurring theme in programming, but different computer languages have differing levels of support for them. For the purposes of these posts, I'm going to define a sequence as:
  1. Zero or more homogeneous elements
  2. Elements may be repeated
  3. Elements have a position (or positions, in the case of repeated elements) within the sequence
  4. The only operation allowed for a consumer is to fetch the next element; this may fail if we have reached the end of the sequence
  5. Sequences may be endless
This definition is known by many names in different languages (lists, enumerations, streams, iterations, etc.) but I'll stick with "sequence" to avoid ambiguity.

Let's break down the definition. Point 1 sounds quite restrictive: what if we want a sequence of integers and strings? This isn't really a problem in a language with composable types; we just define the type of the elements in our sequence to be the union of all the expected, individual types. In egg, we could use the hold-all "any" (or "any?" if we want nulls too).

Point 4 hints that, from a programming point of view, the consumer of a sequence doesn't see very much. But what does it need to see?

Support in Existing Languages

Rust has a very good interface for what it calls iterators. It has a single member function that returns the next element or nothing:

    // Rust
    fn next(&mut self) -> Option;

JavaScript has a similar protocol. The single member returns an object with two fields ("value", which may be undefined, and "done", which is a Boolean):

    // JavaScript
    function next();

Other languages are, in my opinion, a bit inferior in this respect. C# defines an enumerator interface with three members:

    // C#
    public T Current { get; }
    public bool MoveNext();
    public void Reset();

My two concerns with C#'s "IEnumerator<T>" are:
  • Getting the next element is not an atomic operation: you need to move to the next element and then read it.
  • The "Reset" method suggests that sequences are always "replayable". This is rarely the case (see the comment in the source).
In Java 8, sequences are named "iterators". The interface has four members:

    // Java
    default void forEachRemaining(Consumer action)
    boolean hasNext();
    T next();
    default void remove();

The first is a helper that can be overridden to optimise sequence operations. The next two exhibit the same atomicity problem that C#'s interface has. The final "remove" method is just plain strange. Needless to say, the default implementation throws a "not supported" exception.

C++ iterators are heavily templated and do not, in general, work via interfaces (though see Boost for some work towards this). Iterators in C++ are often bi-directional.

Atomicity

As I mentioned above, C# and Java have potential atomicity problems. In the vast majority of cases, these never appear or can be avoided by using additional locks. However, sequences can be a useful building block in concurrent systems (e.g. CSP), so native atomicity is a desirable feature.

For example, imagine a sequence of tasks being used as a queue of work for two or more consumers. If the consumers cannot atomically fetch the next work item from the sequence, there is the potential for work items to be accidentally skipped or even performed twice.

For that reason, if written in C++17, one interface for sequences could be:

    // C++
    template<typename T> class Sequence {
      virtual ~Sequence() {}
      virtual std::optional<T> next() = 0;
    };

This is effectively the Rust interface above.

Sequence Functions in egg

In egg, we can simplify this interface (for integers) to a single function signature:

    // egg
    void|int next()

We could have used the following:

    // egg
    int? next()

But that would mean that sequences could not contain null values.

The type of a sequence function for integers is therefore "(void|int)()" which is somewhat ugly. I originally thought that abbreviating this to "int..." would be quite intuitive, but quickly ran into syntax problems (see later). At present, I am toying with the idea of using "int!" as a shorthand, based on some fuzzy notion that other languages using "!" when dealing with channels.

Thus, if a function took two integer sequences and merged them in some way, the signature could be:

    // egg
    int! merge(int! left, int! right)

This implies that sequences are first-class entities: what R. P. James, and A. Sabry call "externalised yields" that can be used outside of the "for-each" control statement.

However, this only deals with consumers of sequences. How does one produce a sequence in the first place?

Sequence Generators

One way to create a sequence is via a container that permits iteration. Most languages that support the "for-each" control statement allow you to iterate around elements in an array, for example.

    // JavaScript
    var a = [10, 20, 30];
    for (var v of a) {
      process(v);
    }

However, one of the strengths of sequences is that the elements can be generated on-demand or lazily. Indeed, some sequences are infinite, so populating a container with all the elements of a sequence may be impractical or impossible.

Both C# and JavaScript have the "yield" keyword. C, C++ and Java have non-standard mechanisms for achieving similar effects, but they are not part of the language.

Here's a generator function to produce the infinite, zero-based, Fibonacci sequence in JavaScript:

    function* fibonacci() {
      var a = 0;
      var b = 1;
      for (;;) {  
        yield a;
        var c = a + b;
        a = b;
        b = c;
      }
    }

And in C#:

    public IEnumerable<int> fibonacci() {
      int a = 0;
      int b = 1;
      for (;;) {
        yield return a;
        int c = a + b;
        a = b;
        b = c;
      }
    }

The syntax is slightly different, but the fundamentals appear similar. Surely that means that generator functions are trivial to implement? Alas, no.

Next time, I'll discuss why generators are so complicated under the surface. In the mean-time, might I again suggest this paper as some bed-time reading?

No comments:

Post a Comment