Scalability

Open Stories and Issues

Deal with Supervision properly: I haven't been worrying much about Supervision until we have moved to a scalable architecture, but once we do, there's no excuse not to get this right.

Dismantle remainder of SpaceManager: We've already moved the main Space message routing out into ClusterSharding, which takes most of the weight off of it. But the remainder should probably get moved into UserSession.

Make the NotificationActor scalable: Currently, it's a singleton, so that clearly must change.

Rewrite AdminUserIdFetcher: This private worker in querki.identity fetches all of the userIds in the system, ultimately for use in Notification. This is clearly broken conceptually.

Closed Stories and Issues

AdminActor should be replaced with workers: This was originally "AdminActor should be a Cluster Singleton", but I realized that that's just wrong. There is nothing there that needs to be even long-lived, so the right answer is to just replace it with temporary focused Actors that do what is needed then go away.

As soon as one node begins to load a Space, it should immediately "own" that Space: That is, if we're going to avoid evil race conditions, there probably needs to be a reliable concept of which node a shard lives on.

I should see a manageable subset of Instances: This is obvious from the Comics Space, which is very slow to render the root page and the Title Model. We need to be smarter about trimming that back.

Make the IdentityCache scalable: Currently, this is a singleton. It should not become a Cluster Singleton -- it is hit much too frequently. So we need to figure out how to shard this nicely.

Make the UserCache scalable: Similar to the story for IdentityCache, but a bit trickier.

Move Photo processing into UserSpaceState: Currently, PhotoController still uses withThing(), and passes SpaceState around a bunch. That's no longer legal, so that code needs to get mostly pushed into UserSpaceState, or something like that.

OID management needs to be properly sharded: While the current approach probably still works, it's fragile -- it depends on all the systems getting into transaction contention over one row in one table. We need to do better.

Remove withSpace, withThing, etc: One of the hard rules of the new architecture is Thou Shalt Not Pass SpaceState Between Nodes. So it's time to remove a lot of old UI code.

Sanity-check the PhotoUploadManager: This may be a non-issue: at a quick glance, it looks like having a separate copy per node is likely fine.

Shard the SpaceManager: I suspect that the SpaceManager needs to be split, with the routing becoming just Cluster Sharding and other functions becoming a separate Actor or Ecot. (Is there anything here that can't be run in a UserSession or UserSpaceSession?)

Shard the UserSessionManager: In principle, this seems like a straightforward use of Cluster Sharding.

Space.loadSpace is blocking, and must not be: Not technically blocking clustering, but this is a pretty evil bug -- it borders on Critical simply because it's a ticking time bomb of bad. Space.loadSpace currently blocks on Load, and Load can take quite a while, especially if we need to evolve the space. That's an absolute no-no.

While a Space is undergoing Evolution, it should be locked: Currently, there is a rare but terribly dangerous race condition in querki.evolutions.Evolutions. This is a process that runs a relatively long time, and I have no confidence that, if it starts on two machines simultaneously, it won't product horrible brokenness.