Scalability
Open Stories and Issues
Deal with Supervision properly: I haven't been worrying much about Supervision until we have moved to a scalable architecture, but once we do, there's no excuse not to get this right.
Dismantle remainder of SpaceManager: We've already moved the main Space message routing out into ClusterSharding, which takes most of the weight off of it. But the remainder should probably get moved into UserSession.
Rewrite AdminUserIdFetcher: This private worker in querki.identity fetches
all of the userIds in the system, ultimately for use in Notification. This is clearly broken conceptually.
Closed Stories and Issues
AdminActor should be replaced with workers: This was originally "AdminActor should be a Cluster Singleton", but I realized that that's just wrong. There is nothing there that needs to be even long-lived, so the right answer is to just replace it with temporary focused Actors that do what is needed then go away.
Make the IdentityCache scalable: Currently, this is a singleton. It should
not become a Cluster Singleton -- it is hit much too frequently. So we need to figure out how to shard this nicely.
Move Photo processing into UserSpaceState: Currently, PhotoController still uses withThing(), and passes SpaceState around a bunch. That's no longer legal, so that code needs to get mostly pushed into UserSpaceState, or something like that.
OID management needs to be properly sharded: While the current approach
probably still works, it's fragile -- it depends on all the systems getting into transaction contention over one row in one table. We need to do better.
Remove withSpace, withThing, etc: One of the hard rules of the new architecture is Thou Shalt Not Pass SpaceState Between Nodes. So it's time to remove a lot of old UI code.
Shard the SpaceManager: I suspect that the SpaceManager needs to be split, with the routing becoming just Cluster Sharding and other functions becoming a separate Actor or Ecot. (Is there anything here that
can't be run in a UserSession or UserSpaceSession?)
Space.loadSpace is blocking, and must not be: Not technically blocking clustering, but this is a pretty evil bug -- it borders on Critical simply because it's a ticking time bomb of bad. Space.loadSpace currently
blocks on Load, and Load can take quite a while, especially if we need to evolve the space. That's an absolute no-no.
While a Space is undergoing Evolution, it should be locked: Currently, there is a rare but terribly dangerous race condition in querki.evolutions.Evolutions. This is a process that runs a relatively long time, and I have no confidence that, if it starts on two machines simultaneously, it won't product horrible brokenness.