2

I work with large numbers of medium-small document (~2 meg) data files and am trying to determine the fastest way to lookup values based on time-stamps.

This would be simple if I were looking up "Find data for timestamp X," but I generally want "find the most recent data whose timestamp is before-or-on date X."

Here are specifics: Imagine you have a cluster of 300 houses, each of which occasionally gets mail. You are monitoring the type of mail they get. Say there are 15 categories of mail you care about.

The question of interest is "What was the most recent category of mail delivered to the house on-or-before date D?"

A. The data files being referenced has the form:

<data>
 <house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
 <house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
 <house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
 <house house_ID = "XXX" mail_category="YYY" timestamp="ZZZ"/>
 ...
</data>

B. The data file is not necessarily sorted. If this makes a difference in best practices, please indicate in your answer.

C. Even though ~300 houses are traced in the data file, I only need data from 60 specific ones for my work.

D. Information exists for 100 dates, and most houses get mail on 3-20 of those 100 dates.

E. Mail can be delivered throughout the day. So on a given day, a person could first get category 1, then later get category 2, and then finally get category 8 in the evening.

F. For a typical data document, a given house's information will likely be requested about 10 times.

Here are two possible paths, and my thoughts on each one. I'm hoping one of the XSLT3-super programmers will have a better option.

Solution 1: Large Map Maps are generally the preferred solution to many XSLT3 speed questions, but I'm not sure how amenable they are to this question because it seems like you have to create a huge map, most of which you never actually need.

What I have tried is sketched below:

<xsl:variable name="sorted_data" select="saxon:sort(houses I want from data, by date)"/>
<xsl:variable name="dates" select="distinct-values($sorted_data/date:date(@timestamp))"/>
<xsl:variable name="mail.map.pieces" as="map(*)*">
 <xsl:for-each-group select="$sorted_data" group-by="$house_number">
   <xsl:iterate select="current-group">
      Use iteration to form one map for every possible date/house, reading data file once.
      map has form  map{concat($date'--'$house_number) := last_mail_type}
      Note that this internal piece requires a bit of extra computation because you need a map for _every_ date in $dates, but the set being iterated over only contains nodes for dates on which the house received mail.
   </xsl:iterate>
  </xsl:for-each-group>
</xsl:variable>

<xsl:variable name="mail.map" select="map:new($mail.map.pieces)"/>

The issue is that constructing this map requires 60 * 100 map{} commands, only 10% of which will be used. There were also be several calls to deal with the missing days problem.

Solution 2: Small Maps

Another option using maps is to associate all mail data for a given house to that house_ID, and then take the search/filter hit later:

<xsl:variable name="sorted_data" select="saxon:sort(houses I want from data, by date)"/>
<xsl:variable name="dates" select="distinct-values($sorted_data/date:date(@timestamp))"/>
<xsl:variable name="mail.map.pieces" as="map(*)*">
 <xsl:for-each-group select="$sorted_data" group-by="$house_number">
  <xsl:sequence select="map{house_numer := current-group()}/>
 </xsl: for-each-group
</xsl:variable>
<xsl:variable name="mail.map" select="map:new($mail.map.pieces)"/>

Then later, to answer the question for a given date, you will need to select among the small number of data associated with that house:

Most recent mail category prior to date d for house x = map:get($mail.map, x)[current()/date le d][last()]/@mail_category

This obviously takes less work to create the maps, but retrieving the data requires more work each time because of the extra filtering. There is also the issue that the "Large map" solution would allow me to connect a house/date directly to the value I want [the mail type], while this method requires me to connect the key value to the node, so there will be the added cost of reading out the mail category information from that node.

One final advantage this has over solution 1 is that it easily covers the alternate question of "what is the most recent mail type on or before time T" [so instead of based on date it is based on the actual timestamp.]

Solution 3: Keys Another option is to use keys, keying all the mail to a given house to its house_id. In theory this should work very similar to the "Small Map" option. You use a key to retrieve just the mail for the house you want, and then you filter to select the mail most recent but before or on the date needed.

However, there are differences in the construction part. The maps required a for-each-group operation and then one map operation for each house. The construction of a key takes less time I expect.

On the other hand, a key only works on document modes. If the original document is not sorted, then I would need to sort the document and create a new document in memory to work on. I cannot simply build a key on the sorted sequence of nodes. I don't know the relative cost of creating this document in memory, but I imagine it is more than the time required to construct the map in solution 2.

If the original document is already sorted, then the key may be faster?

David R
  • 994
  • 1
  • 11
  • 27

1 Answers1

0

Sorry, you've given very careful thought and time to formulating this question, and I would really want to give equal care and attention to answering it, but I don't have the time.

Of course the problem with both maps and keys is that they do equality matching only. I don't know whether you're interested in using extensions, but it looks like a good case for the "range keys" introduced in Saxon 9.5: see http://www.saxonica.com/documentation/index.html#!functions/saxon/key-map

There are two main ideas here: firstly, it allows a key to be used as a map, so you can iterate over all the key values. Secondly, it provides guaranted ordering of the map entries, so you can do the traversal in key order.

This should allow you, with a little ingenuity, to build for example a map that indexes all the postal deliveries for a particular week, and then to scan these in date order. I would think this could give quite an efficient solution to your problem.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Is it allowed to create a map whose entries are key-maps? The issue is that I don't think I would be able to break things up into weekly chunks since a given house may not get mail for weeks and weeks. However, I could break things up in terms of houses. So if my key value were something like HHH---DDD, where "HHH" is house number and "DDD" is date, then (to avoid having to scan all entries every time), I could essentially make one range-key for each house and then use a map to call the range-key I'm interested in. – David R Sep 16 '13 at 22:53
  • In other words, to avoid having to say map:keys($map)[. gt 30-0000-00-00][. le concat($house,'---',$date)][last()], I could use this outer map to call the range-map associated with house $house, cutting down tremendously on the number of key entries that have to be scanned. – David R Sep 16 '13 at 23:00
  • So if $big.map is the map whose entries are the individual range-key-equppied maps (with keys = house number), then I would be looking at: map:key(map:get($big,map, $house.number))[. lt $date-as-string][last()] (Sorry I realize I'm being sloppy with terminology... When I said "range-key" or "range-map," I really meant the map constructed USING the saxon:key-map() function built on a range-key.) – David R Sep 16 '13 at 23:02
  • This all looks feasible in theory. Feel free to try it, and if you hit trouble, talk to us at saxonica.plan.io to move forward. – Michael Kay Sep 17 '13 at 07:55