Introduce a concept of multi-History
Summary: This is a follow-on to
Playtesting Space is hanging frequently, to address what seems to be the current problem there. It is enormous, currently hypothetical, and might turn out to be unnecessary, but let's start seriously thinking about it now.
From what I can see of
Playtesting Space is hanging frequently, my current best guess is that the problem relates specifically to the
size of the history -- in particular, the symptoms smell like, when we start loading a Space with an enormous History, it takes a very long time to scan to the
start of the stuff to read in. (On the order of 60 seconds for the SI Playtesting Space, which has around a hundred thousand history records.) That is causing timeouts. I have
hopefully patched that problem by extending those timeouts, but that's a hack: we need a longer-term solution.
Note: there is a non-trivial chance that, once we can upgrade akka-persistence-cassandra to a decently modern version, this problem will go away, so let's preferably not tackle this ticket until after we do that.
At the airy, hand-wavey level, a possible fix is to automatically split the history of a large Space. There is nothing sacred saying that a Space can only have one persistence stream. So what about saying, after, eg, 10k history records, we consider the history to date to be archived and start a new one? The Space's persistence ID would get a "kicker" number; we would open a new Actor for it, take a snapshot, drop the snapshot in to initialize the new Actor, and go from there.
We would record the current kicker in the main SQL database, so that we know which Actor to actually load.
When doing History operations, we would need to introduce some coordination, but conceptually it isn't hard, since those loads are entirely under program control -- we would just stitch together the streams into a single logical stream.
This needs lots more thought and design, but in principle there is no good reason it can't work. (Right?)
Plan
Okay, let's take this seriously, and figure out what needs to be done. The more I think about it, the less confidence I have that the upgrade is going to save us.
The goal is easy to summarize: a given Space is no longer composed of a single persistenceId, but of a sequence of "segments, monotonically increasing. Each segment begins with a SetState
of the current state, and runs for around N events. (Where N is probably around 10k.) Each segment ends with a new SegmentFinished
event, which tells the system to deterministically go on to the next. If a SpaceCore
is in the finished state, and receives a change event, it should forward that to the next segment.
All of this should be automatic, and largely hidden from users. It will become the normal mode of operations from here on, once it is fully implemented.