Internals

Open Stories and Issues

As a Querki Dev, I can write asynchronous functions: Of course, we do this already, but it's horrible. This story is to do it right.

As a Querki programmer, I can enhance QText more easily: This is the motivating story to rewrite the QText parser from scratch, probably in FastParse.

Change internal HTML data chain: Purely internal note to myself, for dependency tracking: change the way we pass HTML around in the system.

Change the Akka serialization protocol: We're currently using Java serialization for inter-node communication (we're using Kryo for persistence); by all accounts that is bad.

Consider getting rid of the SpaceMembership table: This is one of the "fattest" tables in the system, being Identity x Spaces. Can we get rid of it in the new world?

Conversations should time out: Currently, once we load a Conversation into the SpaceConversationsActor, it stays there until the Space itself times out. That's probably unwise: we should have a separate timer for each Conversation.

CurrentState() messages are probably causing OldUserSpaceSession to remain alive: This isn't critical until people are using the Arisia Volunteers Space hard, but has the potential to crash the server during Arisia itself, so it must be fixed!

DB Actors should be using an appropriate Dispatcher: Right now, all Actors are using the standard Dispatcher. This is known to be bad; the MySQL-oriented ones should use a PinnedDispatcher instead.

DB Persistence errors are *too* quiet: The fix to Changing an object's state in Space.modifyThing() isn't sufficiently atomic has the side-effect that the UI no longer waits for DB modify operations to complete. This is generally an improvement, but means that, in the dreadful case that the DB update fails for some reason, it's rather invisible to the user.

Email Address should not exist on Person: Currently, the Person record (the view of an Identity in a Space) contains that Identity's email address. We are going to significant effort to hide that email address from end users. But it really shouldn't be there in the first place.

Get Profiling working again: Much of profiling got scragged by the rewrite of the QL Pipeline to use Futures. It needs a deep rethink in general.

Improved way to create RequestM: At Alex's talk on Monix Tasks, he described how Monix Task.create works. Instead of the two-phase approach I used for RequestM, based on Promise, he passes a function to create, which is the stuff between prep and returning. This is clearly a better way, so replace prep with this!

In the Client, DataSetting.setThing() is deeply suspicious: It's side-effecty; worse, InputGadget.thingId usually relies on it. It's a nasty code smell, and I've already encountered serious bugs caused by it. We should instead be rigorously setting data-thing explicitly when needed.

Introduce a concept of multi-History: This is a follow-on to Playtesting Space is hanging frequently, to address what seems to be the current problem there. It is enormous, currently hypothetical, and might turn out to be unnecessary, but let's start seriously thinking about it now.

Introduce BulkSpaceEvent: At the moment, we're doing things like SecuritySpacePlugin as multiple Events. Conceptually, that's wrong -- they should really be one aggregated Event.

Investigate using Lenses for Space manipulation: The Space code has a bunch of slightly-messy structure-rewrites. This is exactly what Lenses are good for, so we should consider using them.

Invocation should be based on the State Monad: This is to fix the "elem" problem -- the fact that you have to manually remember, after calling something like contextBundlesAndElements, to use the elem later in the for comprehension. If you goof, you get multiplicative horror. We need this to be automatic.

mainIdentity is a fundamentally broken idea: We're using the convenient concept of mainIdentity in more and more places in the code. But it really doesn't make sense -- if I have a FB login and a direct Querki login, which one is "main"? I suspect we're getting into trouble here.

Notification Rendering no longer works appropriately: This design bug isn't user-visible -- yet -- but shows up as a horrible hack in the Client's NotificationsPage. CommentNotifier shows the bug clearly. The issue is that we're rendering the link to the Comment deep in the server, inside CommentNotifier, but we really should be doing so client-side.

Notifications are still on the old persistence system: This isn't actually breaking anything, but it's a very bad smell. And now that I'm adding userland ability to send notifications, it's going to become more problematic.

Optimize the Read permission recalculation when a Space changes: This is currently wildly inefficient, and while I try to avoid premature optimization it seems likely to become a disaster for the Arisia Volunteers Space: with hundreds of people frequently modifying a Space with tens of thousands of records, it's likely to bog down.

Querki should be able to failover to a new region if necessary: Inspired by The Great AWS Outage of September 20th (2015): we should be able to cope if us-east-1 goes down.

Querki should move away from hashbang URLs: It turns out that, even as we were moving towards hashbangs, the rest of the world was moving away from them.

Querki shouldn't have so many tables hanging around: Querki originally simulated Cassandra partitions by using Lots and Lots of MySQL tables. Those are mostly no longer needed, and they slow down DB upgrades a lot.

Race condition during node startup: Potentially, Play can start receiving and processing requests for this node before it has joined the Cluster; badness will presumably result.

Remove Deprecated invite / login / identity code: The invitation system has massively changed. After a brief deprecation cycle, remove the dead code.

RequestContext should be passed more consistently: We pass the RC a lot, enough to probably standardize on it. Doing so would have huge advantages, not least being able to use it as a thread of control for logging. If QLog took an implicit RC, then it could basically preface spews with a request ID.

Rewrite Requester to be tell-based rather than ask-based?: This post points out the fragility of using "ask" too much. I suspect it'll eventually become an issue for us -- while we don't often use ask explicitly, we do use it under the hood of Requester, and we use that a lot. Also, we use ask to go from the Play layer to the middle layer.

Rewrite the internal QL pipeline: This one is still deeply hypothetical, and huge, but the potential performance gains are sufficiently huge as to be worth really digging down and dealing with it.

Space Persisters need their own thread pool: This hasn't proven to be a problem yet, but probably will be. At the moment, we allow an arbitrary number of Space Persisters to try to write simultaneously. That's fine unless the DB communications load completely; if they do, the Space Persisters will eventually starve the global threadpool.

The Apps table in the DB is apparently vestigial, and should be deleted: It's not doing any harm, but it's cluttering and confusing.

The Client should be able to go Space-to-Space without reloading: Currently, whenever I go from a Space to index, and thence to another Space, it reloads the entire Client twice. This is annoying and slow; we should be able to cope with switching within a single Client.

The internal QL dataflow pipeline should probably be stream-oriented: Right now, everything is eagerly evaluated, which is conceptually dumb -- it should, in principle, be lazy.

User Values should be properly Persisted: This is basically a marker so we don't lose track of this half-completed Epic, which I started and then realized wasn't critical-path yet.

UserPersistence should all be asynchronous: Pretty much all functions here are MySQL calls, and thus relatively slow. They should all be getting performed on a separate, dedicated MySQL threadpool, and should all be returning Futures. (Or better yet, IO.)

UserPersistence should be Future-centric: Currently, most of the calls in UserPersistence are synchronous. That's insane. Rewrite them to be async instead.

Closed Stories and Issues

Changing an object's state in Space.modifyThing() isn't sufficiently atomic: There is a race condition, deep inside the Space code. It isn't huge, but there is probably a fair fraction of a second wherein, if we get two change requests for a given Thing, one of them may be lost.

Creating a Space often gives an Unexpected Error: Two times out of three, when you try to create a Space in the production environment, you get no response, and eventually get an Unexpected Error, even though the Space has been created successfully.

editThingInternal() needs to stop using raw OIDs: In particular, the "model" parameter is a raw OID; it should have always been a ThingId. Fixing this is going to be a PITA, since we have screwed it up in so many places, but it needs to happen -- at the moment, it is causing all sorts of bugs.

Excessively large pages fail to return properly: This is purely a cluster-serialization problem, and I should have thought of it before -- if the wikitext for a page is more than 128k, it doesn't get properly serialized and sent.

Figure out a decently appropriate way to cache calculated info on the SpaceState: There are lots of subsystems that calculate piles of data about the SpaceState. We desperately need a way to cache the results in the State.

Going to certain pages when Space is not yet loaded fails: Not sure what's going on, but this shows up in my new Sandbox. It always recovers, but it's worrying.

I can export a Space, and re-import it into another Querki Instance: As the Owner, I should be able to get, eg, a faithful JSON representation of the entire Space. I should be able to turn that into a full copy of that Space later.

I should be able to use Functions in a natural OO way: See the Other Details, but the high concept is that I should be able to use Functions as Methods properly, including overriding them in sub-Models.

I should not need an Identity record for a Trivial Identity: Currently, we have to have a DB record for all Identities, even the completely empty ones created from Shared Links. That's adding bulk to the MySQL table for no good reason.

If an Exception is thrown during Permission Checking, we get an RSOD: That is, if I am not the owner of the Space, and for some reason the attempt to filter out what I can read throws an Exception, I can't load the Space at all.

If I leave off the final slash at the end of a Space's URL, it should work: Because it's an obvious mistake to make, and there's no good reason it doesn't work.

If you delete a Model that has Instances, the Space won't load any more: Almost the definition of a P1 bug: if you delete a Model that has Instances, things start to act a bit wonky. But once the Space passivates, you can't reload it -- it crashes on reload.

It seems to be possible to get split-brain: I've only seen this one, for this Issue Tracking Space, and don't understand the details. But there were clearly two shards in existence for this Space.

QL processing needs to become asynchronous: This is big and ugly, but will become necessary. QL processing currently happens synchronously, but that's going to become a major scaling problem eventually.

The SpaceMembership table should include the Owner!: The fact that it doesn't is causing some unnecessarily complex and expensive DB code.

When we reload the events for a Space, we aren't incrementing the snapshotCounter: This is apparently why some Spaces are taking a very long time to load: unless we cross the snapshotCounter threshhold (100 changes) in a single load of the Space, the Events keep building up without a Snapshot.