Tuesday 26 June 2018

egg Garbage Collection 1

I've always struggled with tracing garbage collection. Until last week, I assumed it's a bit of a black art, the internals of which are only understood by a small number of acolytes. In truth, you can go a long way without needing anything more than reference-counting. Indeed, Apple ditched their tracing garbage collector in 2015 in favour of an "automatic reference-counting (ARC)" solution. But memory management is a thorny issue for learners of any language or infrastructure, so I knew that tracing garbage collection was fairly inevitable for the egg run-time.

Background

Garbage collection is any mechanism whereby software knows when a resource is no longer accessible (is "dead", not "alive") and can therefore safely be freed up. For heap-based memory objects, "dead" instances are those that can no longer be reached via pointers.

There are three main forms of garbage collection:

Automatic Reference Counting

Each "live" instance contains a "reference count" of the number of pointers pointing to it. When a new reference is made to the instance, the count is incremented. When a reference is removed, the count is decremented. When the count is decremented to zero, there are, by definition, no pointers pointing at this instance: it can be declared "dead".

One potential problem with ARC is managing cycles. If instance "a" is pointing to instance "b" and instance "b" is pointing back to "a" (either directly or indirectly), the cycle keeps both instances alive (with reference counts of at least one) even if there are no other pointers pointing at either of them. Sometimes cycles can be broken using "weak references" but this is tricky and I won't explore that technique here.

Mark and Sweep

"Mark and sweep" and its variants are tracing garbage collectors that periodically trace which instances are unreachable and therefore suitable for "collection". In its simplest form, the garbage collector "stops the world" (i.e. prevents new instances being created and links between references from changing) and then performs the following:
  1. Create a list of all instances: "everything"
  2. Create an empty set: "marked"
  3. From list "everything", create another list of instances that are accessible from the outside world: "roots"
  4. For each "node" in "roots"
    • (MARK) If "node" is not already in "marked" then
      • Add "node" to "marked"
      • Repeat (MARK) recursively for each pointer in "node" to other nodes
  5. For each "dead" in "everything" that is not in "marked"
    • (SWEEP) Reclaim "dead"
This technique gets around cycles of instances, but it can be quite time-consuming (unpredictably so) and can use extra memory for housekeeping.

C++ Implementations

I've always thought that garbage collectors had to plumb themselves into C++ at quite a low-level; for instance, by redefining "operator new" or using special macros like the excellent Boehm-Demers-Weiser collector. However, I wanted the garbage collector for egg to be easy to switch out for an alternative in the future, so I looked for a less invasive/pervasive solution.

It turns out that garbage collection is just like any other piece of software: you can design it to be as decoupled or coupled as you like (with certain trade-offs), so I set about designing an infrastructure that tries to balance some desirable attributes:
  • Fast to execute
  • Easy to maintain
  • Simple to learn and use
  • Foolproof
  • Quick to implement

Baskets

The first solution I came up with is based around "baskets".
  • A "basket" is made up of zero or more "collectables"
  • Each "collectable" can belong to at most one "basket"
    • A "collectable" that does not belong to a "basket" is deemed an orphan
  • Each "collectable" may optionally be a root of its "basket"
    • All roots are deemed to be always reachable
  • Each "collectable" has zero or more "links" to other "collectables" in its "basket"
    • A "link" can be created, severed or amended
Actions that the programmer needs to perform include:
  • Creating a basket
  • Creating a collectable
  • Adding a collectable to a basket
  • Marking a collectable in a basket as being a root/non-root
  • Creating a link between two collectables in the same basket
  • Modifying a link to point are a difference collectable in the same basket
  • Severing a link
  • Collecting garbage
Note that there is no explicit way to remove a collectable from a basket. However, it may be useful for performance and testing to purge all the collectables from a basket as a single action.

The naive garbage collection algorithm is then:
  1. Stop the world (at least for this basket)
  2. Create a set of all collectables in the basket: "unvisited"
  3. For each "node" in this basket's list of roots
    • (MARK) If "node" is in "unvisited" then
      • Remove "node" from "unvisited"
      • Repeat (MARK) recursively for each link in "node" to other nodes
  4. For each "dead" in "unvisited"
    • (SWEEP) Reclaim "dead"
  5. Start the world
Notice that keeping track of the unvisited nodes, instead of the visited ones, simplifies the logic slightly.

In this scheme, baskets don't actually perform memory management operations on collectables. Collectables are allocated externally and then added to the basket. Similarly, the function to collect garbage returns a list of collectables that have been evicted from the basket. (In reality, it uses the visitor pattern to expose this list, thereby removing the need to physically construct it).

The Root of All Evil?

Attentive readers may have noticed that the concept of "root" is poorly defined. Also, if we want to extend this scheme to concurrent garbage collection (and who wouldn't?), how do we make sure we can set up the web of links between newly-added collectables before the collector sweeps them from under our noses?

Shock! Horror! There's more...

No comments:

Post a Comment