Rewrite Search

(Incomplete Feature, To be Fixed , Priority: Critical, Test Status: No automated tests yet , Reported By Justin du Coeur, )
Summary: The current implementation of Search is kind of braindead. We can do better.
This started as a bug ticket, but I think it's worth breaking this out into an enhancement instead, since it is going way beyond a mere bugfix.
Currently, Search is utterly brute-force: every time someone tries to search for something, we literally read through every text property on every Thing, looking for that String. In practice, that tends to go decently fast, but in practice it sometimes isn't working, and let's get real -- that goes against every best practice for modern search techniques. Let's rewrite this mess.
The high concept is that we should start by creating and maintaining an index of the Space: go through all of those text properties and build up a map of word usage. When someone does a search, we break it down by word, find all the thing/property pairs where each word is used, and compute a score on each thing/property based on how close it is to the exact phrase.
Important: throughout, we should be indexing and searching on words, not punctuation! I believe there are JVM built-ins for determining whether a character is a word character or not. Also, keep in mind that some punctuation characters create word breaks and others don't -- pay attention to that, and test it carefully.
In general, this needs lots of lots of tests, both unit and scenario. Make sure it does everything we expect.
The gameplan is:
  • Create a new SearchActor, as part of the Space's troupe. This is where the index will live. As with most of the troupe, it should be registered to receive the CurrentState, but it does not initially act on that.
  • Give that an entry point to compute the initial index, based on the CurrentState. This is likely an fs2 pipeline that breaks down the Space and in parallel computes the index pieces. (It's basically a big reduce/merge operation.) If the index already exists, this is a no-op.
  • When the user clicks into the Search box, start that indexing up, to get things going as quickly as possible.
  • Whenever a Space operation happens, update the index accordingly. This is the hardest part to get right: we need to compute the change implied by this operation, and update the index. That requires more than the standard CurrentState -- we need to hook into the SpaceChangeManager, so that all ThingChanges fire off a Search update. If the index has not been computed yet, this is a no-op. (Later: actually, do we need anything else? It looks like CurrentState includes the events -- we're just not using them for much yet.)
  • When an actual Search request comes in, it routes to the SearchActor. This builds the index if it has not already been computed. It decomposes the search request, finds thing/property pairs that contain the words, and computes a score for each based on quality of match.
  • Follow-on: optimize the index. The naive version of this ticket will have an unoptimized record with the Thing OID and Property OID of each match. That's 16 bytes per match, which is a lot of memory. So as a fast-follow, we should create an optimized version of that structure. We can almost certainly get that down to 8 bytes, and we could manage 4 by creating a highly-compressed index of thing/property pairs -- literally just a Vector of the pairs with a 4-byte Int index. (But we would also need a matching reverse index, from the thing/property OID pairs to this Vector, so that we could update it.)
  • Follow-on: mandatory exact matches. If a search term is in double-quotes, then those exact words must appear, in that order. (Follow Google's syntax -- here's a good article.)
  • Later follow-on: more advanced capabilities like and/& or or/|. (See above-linked article.)
  • Much later follow-on: fuzzy word matching. This is a medium-priority goal for quite a ways down the line, but being able to match different parts of speech, common mis-spellings, and stuff like that would be lovely. (That said, we might well redo all of this on top of ElasticSearch before we get there.)