OID management needs to be properly sharded

(Bug, Closed -> Fixed, Priority: Critical, Test Status: No automated tests yet , Reported By Justin du Coeur, )

Summary: While the current approach probably still works, it's fragile -- it depends on all the systems getting into transaction contention over one row in one table. We need to do better.

This one's non-trivial. We probably need to change the OID table to have one row per shard. Each node has an OIDAllocator Actor, which sits on top of one shard, so that the nodes can do their allocation in parallel. There should be a master OIDManager ClusterSingleton, which is in charge of assigning each node a shard at startup time.

Issue: how do we deal with the OIDManager moving? When the new singleton comes to life, it needs to know which shard is assigned to which node, with a high degree of accuracy. It can err on the side of being too conservative, assuming that an allocation exists when that node has actually gone down, but this needs to eventually heal. Assume that the system is always-up, and that the Manager moves frequently -- how does the system deal with nodes going down and their shards eventually getting reassigned.

This eventually needs some very sophisticated unit testing.

Design

Okay, here's the hypothetical design, which I think is roughly optimal.

The key point is that we should give up trying to be quite so efficient about shard allocation. Provided we are willing to allow some shards to be inactive some of the time, many of the synchronization problems become easier.

All of the key components -- the OIDManager and the OIDAllocators -- should be Persistent Actors. The OIDManager is a persistent Cluster Singleton (so movement becomes a non-problem), whose job is to keep track of which Shards are currently in use. It has a map from Nodes to Shard IDs, as well as a list of Shards that are "full" (which starts out set to 3, to avoid the System, Test and old-fashioned Production Shards), and a Long high-water mark for the full shards. (To reduce wear and tear on the list, but not strictly necessary to begin with.)

Each Node has a single OIDAllocator; this is a bottleneck, but likely not a serious one. When the Node starts up, its OIDAllocator contacts the OIDManager and asks for a Shard ID -- a long int. It uses this Shard ID as the identity for its persistence. (This may indicate that we need a pair of Actors: a parent that requests the ID and then the actual OIDAllocator.) This process should probably be a gate during system startup: full initialization shouldn't be considered complete until the OIDAllocator is running. (This implies enhancements to QuerkiRoot and the server-side initialization process. The easiest way to do this is probably to enhance the System ecot so that subsystems can say, "I'm initializing asynchronously; don't start until I give the okay".)

When the OIDAllocator asks the OIDManager for a Shard ID, the Manager looks for the lowest number that is in neither the list of full shards nor the current active mapping, assigns this new mapping and returns that.

The OIDAllocator should have a slow heartbeat to the server. We don't need to worry about this too much, since we're willing to let the Shard namespace be a little sparse. The OIDAllocator should probably ping the OIDManager once a minute, and the OIDManager should consider an OIDAllocator dead if it hasn't gotten a heartbeat in, say, fifteen minutes. (All configurable, of course.) No: the more-idiomatic way to do it is that the QuerkiNodeCoordinator should simply put a watch on the requesting QuerkiNodeManager. No need to reinvent that wheel. But this implies that the stored Event needs to record the requesting Manager's path, so that the Coordinator can find it again during recovery.

Finally, OID allocation becomes a near-trivial persistent operation, of simply incrementing this Shard's counter (remember, it's a persistent Actor) and returning that. It is especially important that the OIDAllocators have a Snapshot scheme, which is nicely trivial: the snapshot is simply the current counter.

When an OIDAllocator hits some threshold -- say, 2^32 - 10000 or something like that -- it sends a ShardFull message to the OIDManager. The Manager adds that Shard to the Full list, and responds with another Shard for this Allocator to now pick up. (Again, this probably requires some indirection: the OIDAllocator's parent should actually manage this process, shut down the old Allocator and build a new one.)

Note: the OIDAllocator Actors might actually want to be command sourced instead of event sourced -- that is, they might use persistAsync(), and immediately increment the counter and respond. If the persisted event is simply Increment, then this converges just fine; it doesn't matter if things get out of order. This would favor low-latency (which would be helpful for such a critical message), with a possible danger of quietly overheating the persistence connection. We should probably start with event sourcing, though: it provides better guarantees, and doesn't risk inconsistency iff persistence goes down.