Tuesday 19 June 2018

egg Pointers and Pointees

This week, I've been struggling with pointers in the egg language specification. First off, a bit of nomenclature: I'm going to use the term "pointer" to mean a reference to data in another location in general; I find the overloaded term "reference" too confusing.

There are any number of pitfalls with pointers, particularly C/C++-style raw pointers. As egg is meant to make programming easier for learners, it's useful to look at some of these issues.

Pointer Arithmetic

C/C++ allows arithmetic on raw pointers:

  // C/C++ code
  char buffer[10];
  char* p = buffer;
  p += 20;
  *p = '!' // Undefined behaviour

This is prone to error, so egg does not allow arbitrary pointer arithmetic. The only thing you can do with them is:
  1. Create pointers by taking the address of a pointee using "&";
  2. Assign one pointer to another;
  3. Pass pointers to functions; and
  4. De-reference a pointer to get hold of the pointee's value
In particular, you cannot increment "++" or decrement "--" a pointer.

Null Pointers

As mentioned in my last blog post, null pointers have been called a "billion-dollar mistake". Consequently, egg pointers cannot contain null.

  // egg code
  int* p = null; // Not allowed

However, there's nothing wrong with declaring the variable as taking a pointer or a null value:

  // egg code
  int*? p = null; // Allowed

Note the subtle difference between "int*?" and "int?*": the former can hold the value null or a pointer to an integer, whilst the latter can hold a (non-null) pointer to a value that is either null or an integer.

Wild Pointers

Wild pointers are raw pointers whose value has not been initialised:

  // C/C++ code
  char* p;
  *p = '!' // Undefined behaviour

The egg language can detect uninitialised pointers, so this shouldn't be a problem.

Dangling Pointers

Dangling pointers are raw pointers that used to point to valid pointees, but no longer do:

  // C++ code
  char* p = new char;
  delete p;
  *p = '!' // Undefined behaviour

The egg language uses garbage collection, so the "delete" (and/or "free") mechanism is redundant (and explicitly disallowed). However, there are other causes of dangling pointers:

  // C/C++ code
  char* p;
  {
    char buffer;
    p = &buffer;
  }
  *p = '!' // Undefined behaviour

The problem here is that "buffer" is allocated on the stack, but that stack memory is "reclaimed" at the end of the block.

The original Pascal solved this problem but not having an "address of" operator. That is, you could not take the address of a variable; you could only obtain a new pointer by allocating data on the heap using "new()". Later flavours of Pascal added the "@" prefix operator for taking the address of arbitrary variables.

In general, stack-allocated values have the potential to cause dangling pointers. See below.

Stack versus Heap

As this Rust document illustrates, the concepts of "stack" and "heap" are important for systems programming and computer science, but they can be confusing for learners. Some languages (like old-style C/C++) force you to be explicit but give very little protection against problems like dangling pointers. Newer languages (like Go and Rust) provide better security, but at the cost of a much steeper (and longer) learning curve.

One solution to the problem is to only allow either stack or heap allocations. Early FORTH implementations only allowed stack allocations: users found this severely limiting. Many scripting languages only allow heap allocations, but for general purpose languages, there are efficiency concerns with this strategy. One compromise, used by many languages, is to introduce the concept of boxing.

Boxing and Unboxing

Boxing is the act of  wrapping fundamental types so that they become "reference" types and so are allocated on the heap. In Rust 1.0, this is explicit:

  // Rust 1.0 code
  let p = Box::new(5); // p points to a heap-allocated integer 5

However, some languages "autobox" fundamental types, which can lead to surprising behaviour. In languages where complex types (sometimes known as "reference" types) are always allocated on the heap, boxing/unboxing is just an annoying edge case.

By-Reference Function Parameters

Most imperative languages have the ability to specify whether individual function parameters are passed by value or by reference. Pointers are one mechanism to achieve this:

  // C code
  void by_value(int x);
  void by_reference(int* x);

  int v = 10;
  by_value(v); // v is not changed
  by_reference(&v); // v may be changed

If using raw pointers, care must be taken concerning null pointers.

In many languages, complex types are passed by what Barbara Liskov et al named call by sharing:

  // JavaScript code
  function by_sharing(x) { ... }

  var v = { a: 1, b: 2 };
  by_sharing(v); // object pointed to by v may change

Therefore, true call-by-reference is often only required for fundamental types.

Pointers in egg

Conceptually, variables (including function parameters) in egg are stored as triples. For example:

  // egg code
  int? v = 10;

will be stored as:
  • Variable name: "v"
  • Variable type: "int|null"
  • Current value: 10
A variable's current value also has a type (in this case "int"); this is known as the run-time type of "v".

Variable triples are generally stored on a stack (one exception is a variable inside a co-routine, but more on that much later) and their values are variants (tagged unions). As variables are introduced, they are added to the top of the stack; as they go out of scope, they are dropped. The variable value variants can store a fairly large number of things (each fits into 64 bits, excluding the tag):
  1. A null value,
  2. Boolean values,
  3. Integer values,
  4. Floating-point values,
  5. References to strings (allocated on the heap),
  6. References to objects (allocated on the heap),
  7. Pointers,
  8. Indirects,
  9. etc.
In these variants, a pointer (Item 7) is a soft (i.e. garbage-collected) pointer to another variant allocated on the heap. An arbitrary level of indirection can be achieved by just chaining together pointers to other pointers.

Consider the following code:

  // egg code
  int v = 10;
  int* p = &v;

The first declaration pushes the following triple to the top of the variable stack:
  • Name="v", Type="int", Value=10
The second declaration has a problem. It tries to push a new triple:
  • Name="p", Type="int*", Value=???
  • Name="v", Type="int", Value=10
But what is the pointer value of "p"? Unfortunately, the value of "v" is not on the heap, as required; if we take its address, we're leaving ourselves open to the dangling pointer problem. Our solution is to patch the value of "v" to be an "Indirect" value (Item 8 above):
  • Name="p", Type="int*", Value=[heap314]
  • Name="v", Type="int", Value=Indirect → 10 at address [heap314]
We do this by copying the previous value of "v" to a new variant (at address [heap314] say) and then modifying the triple for "v" to point to this indirection. Now, the lifetime of the value of "v" is independent of the lifetime of the variable "v". This mitigates the dangling pointer problem, but we must be careful to silently follow "Indirect" references appropriately.

The current implementation patches values in this way if and when the first "address of " operator "&" is invoked on a variable. This means that heap allocations are not required for the vast majority of scalar values created during the execution of an egg program. In practice, it may be safer (in a thread-rich environment) to perform static analysis of the egg code to determine if a variable can ever have its address taken and, if so, create it with the indirection built in. This would eliminate potential race conditions in patching and accessing a variable across threads.

Limitations

This strategy works well with simple variables. There are at least two limitations that I can currently see.

First, this scheme does not extend to co-routine variables. The rules for the lifetimes of co-routine variables are already complicated: adding the ability to extend their lives by taking their address may be a step too far. I suspect storing all co-routine variables on the heap may be a good compromise.

Second, you cannot currently take the address of expressions other than variables.

  // Case 1: constants
  int* p1 = &10;

  // Case 2: arrays
  var a = [ 1, 2, 3 ];
  int* p2 = &(a[1]);

  // Case 3: objects
  var x = { y: 1 };
  int* p3 = &(x.y);

In Case 1, you might reasonably expect "p1" to be pointing at an integer 10 allocated on the heap. You can currently get around this by using a temporary variable, but it may be an edge case I'll revisit.

In Case 2, taking the address of an array element is not permitted. One argument against this ability is to consider what to do if the array is resized to fewer elements. This has re-introduced a dangling pointer.

Case 3 can be argued against in a similar fashion.

No comments:

Post a Comment