I work with large numbers of medium-small document (~2 meg) data files and am trying to determine the fastest way to lookup values based on time-stamps.
This would be simple if I were looking up "Find data for timestamp X," but I generally want "find the most recent data whose timestamp is before-or-on date X."
Here are specifics: Imagine you have a cluster of 300 houses, each of which occasionally gets mail. You are monitoring the type of mail they get. Say there are 15 categories of mail you care about.
The question of interest is "What was the most recent category of mail delivered to the house on-or-before date D?"
A. The data files being referenced has the form:
<data>
<house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
<house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
<house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
<house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
...
</data>
B. The data file is not necessarily sorted. If this makes a difference in best practices, please indicate in your answer.
C. Even though ~300 houses are traced in the data file, I only need data from 60 specific ones for my work.
D. Information exists for 100 dates, and most houses get mail on 3-20 of those 100 dates.
E. Mail can be delivered throughout the day. So on a given day, a person could first get category 1, then later get category 2, and then finally get category 8 in the evening.
F. For a typical data document, a given house's information will likely be requested about 10 times.
Here are two possible paths, and my thoughts on each one. I'm hoping one of the XSLT3-super programmers will have a better option.
Solution 1: Large Map Maps are generally the preferred solution to many XSLT3 speed questions, but I'm not sure how amenable they are to this question because it seems like you have to create a huge map, most of which you never actually need.
What I have tried is sketched below:
<xsl:variable name="sorted_data" select="saxon:sort(houses I want from data, by date)"/>
<xsl:variable name="dates" select="distinct-values($sorted_data/date:date(@timestamp))"/>
<xsl:variable name="mail.map.pieces" as="map(*)*">
<xsl:for-each-group select="$sorted_data" group-by="$house_number">
<xsl:iterate select="current-group">
Use iteration to form one map for every possible date/house, reading data file once.
map has form map{concat($date'--'$house_number) := last_mail_type}
Note that this internal piece requires a bit of extra computation because you need a map for _every_ date in $dates, but the set being iterated over only contains nodes for dates on which the house received mail.
</xsl:iterate>
</xsl:for-each-group>
</xsl:variable>
<xsl:variable name="mail.map" select="map:new($mail.map.pieces)"/>
The issue is that constructing this map requires 60 * 100 map{} commands, only 10% of which will be used. There were also be several calls to deal with the missing days problem.
Solution 2: Small Maps
Another option using maps is to associate all mail data for a given house to that house_ID, and then take the search/filter hit later:
<xsl:variable name="sorted_data" select="saxon:sort(houses I want from data, by date)"/>
<xsl:variable name="dates" select="distinct-values($sorted_data/date:date(@timestamp))"/>
<xsl:variable name="mail.map.pieces" as="map(*)*">
<xsl:for-each-group select="$sorted_data" group-by="$house_number">
<xsl:sequence select="map{house_numer := current-group()}/>
</xsl: for-each-group
</xsl:variable>
<xsl:variable name="mail.map" select="map:new($mail.map.pieces)"/>
Then later, to answer the question for a given date, you will need to select among the small number of data associated with that house:
Most recent mail category prior to date d for house x = map:get($mail.map, x)[current()/date le d][last()]/@mail_category
This obviously takes less work to create the maps, but retrieving the data requires more work each time because of the extra filtering. There is also the issue that the "Large map" solution would allow me to connect a house/date directly to the value I want [the mail type], while this method requires me to connect the key value to the node, so there will be the added cost of reading out the mail category information from that node.
One final advantage this has over solution 1 is that it easily covers the alternate question of "what is the most recent mail type on or before time T" [so instead of based on date it is based on the actual timestamp.]
Solution 3: Keys Another option is to use keys, keying all the mail to a given house to its house_id. In theory this should work very similar to the "Small Map" option. You use a key to retrieve just the mail for the house you want, and then you filter to select the mail most recent but before or on the date needed.
However, there are differences in the construction part. The maps required a for-each-group operation and then one map operation for each house. The construction of a key takes less time I expect.
On the other hand, a key only works on document modes. If the original document is not sorted, then I would need to sort the document and create a new document in memory to work on. I cannot simply build a key on the sorted sequence of nodes. I don't know the relative cost of creating this document in memory, but I imagine it is more than the time required to construct the map in solution 2.
If the original document is already sorted, then the key may be faster?