association-list a veritable mint for dunning-kruggerands

Actors, hierarchy, and anarchist software

A few days ago, Steve Waldman of Interfluidity tweeted the following:

I responded thusly:

Which is true, but at the same time deserves a more detailed unpacking (also more commas).

What this sort of statement typically means in actor systems is that local systems of supervision are hierarchical. Akka’s supervisors look to derive directly from Erlang’s supervisors. In Erlang, an important design principle is to let fail what is going to fail and to restart it cleanly. Supervisors are a generic take on this strategy, with a supervisor starting, watching, and restarting (or reporting) after child actor failures. Supervisors can have their own supervisors, on up the chain to the root supervisor owned directly by the virtual machine.

Note, in the above paragraph, an important distinction from real-world hierarchies. For the most part, Erlang (and, I presume, Akka) supervisors don’t tell their children what to do. Their children know what to do, and supervisors merely mediate when they do it, and clean up after them when they explode. However, in real software, sometimes even this is too much, as a supervisor can represent a serialization point (a concurrency bottleneck) in your software, and you have to decentralize launch and cleanup of critical actor systems.

This leads me to a more important point, which is that in the large, distributed systems tend to be decentralized for both scalability and efficiency reasons. Marshalling all of the data needed to operate a large system correctly into a single decision loop tends to make these systems inflexible, fragile, and slow. As such, distributed systems tend towards decentralized and distributed decision-making systems, such as gossip systems (to globally distribute critical data), leader election (for when centralized decisions need to be made), and most importantly system design that minimizes internal communication and shared state. Riak, the product that I work on, is a reasonable example here. Homogenous nodes with a temporarily elected leader, with all globally important information distributed by gossip. Thousands of actors all working together to make a single system while sending each other as few messages as possible.

So why then supervisors, in the small? My suspicion is time and engineering manpower constraints. Designing these higher level decentralized systems is extremely complex. At Basho, my employer, we spend quite a lot of time getting them right, or as right as we can, and we rely on powerful tools like QuickCheck to help us do it. But if you have to write a formal model for every tiny piece of the system, you’re doomed, at least in today’s fast-moving software industry. So you fall back on informally specified but easier to reason about systems like supervision trees, even though they can end up biting you in the end. When I mentioned supervisors as a concurrency bottleneck above, I was referring to a real-world issue that we encountered, where our get and put FSM supervisors (in a nutshell, these Finite State Machines directly oversee storage and retrieval of data from our disk backends) were becoming overwhelmed in high-load situations, unable to start child actors quickly enough to fully utilize the machine the VM was running on.

So in the end, what does this say about the original query? Hierarchy falls naturally out of untutored designs because we find it easy to think about multi-actor systems this way (also OOP perhaps over-strenuously encourages this sort of thinking, but that’s another post). But in the end, they tend to fall apart, because the parts of the system spend too much time talking to each other to get any actual work done. This is, of course, the logic of the market, or of an anarchist utopia (so many things become isomorphic when viewed from far enough away). I think that the most important lesson to draw from all of this is that designing and understanding highly concurrent systems with many interacting parts (inside or outside of a computer) is a task that lies right at the edge of human cognitive abilities. We should never be entirely comfortable with them, because it’s extremely difficult to tell if they’re correct, so we should always be re-evaluating them, and working on the tooling that allows us to re-evaluate them, to find their defects and correct them.

Be careful out there. Unintended consequences wait everywhere in the tall grass.

<-- Trajectories The road from 2pc to Paxos -->