2

I'm currently struggling to get decent performance on a ~18M documents core with NRT indexing from a date range query, in Solr 4.10 (CDH 5.14). I tried multiple strategies but everything seems to fail.

Each document has multiple versions (10 to 100) valid at different non-overlapping periods (startTime/endTime) of time.

The query pattern is the following: query on the referenceNumber (or other criteria) but only return documents valid at a referenceDate (day precision). 75% of queries select a referenceDate within the last 30 days. If we query without the referenceDate, we have very good performance, but 100x slowdown with the additional referenceDate filter, even when forcing it as a postfilter.

Here are some perf tests from a python script executing http queries and computing the QTime of 100 distinct referenceNumber.

+----+-------------------------------------+----------------------+--------------------------+
| ID | Query                               | Results              | Comment                  |
+----+-------------------------------------+----------------------+--------------------------+
| 1  | q=referenceNumber:{referenceNumber} | 100 calls in <10ms   | Performance OK           |
+----+-------------------------------------+----------------------+--------------------------+
| 2  | q=referenceNumber:{referenceNumber} | 99 calls in <10ms    | 1 call to warm up        |
|    | &fq=startDate:[* to NOW/DAY]        | 1 call   in >=1000ms | the cache then all       |
|    | AND    endDate:[NOW/DAY to *]       |                      | queries hit the filter   |
|    |                                     |                      | cache. Problem: as       |
|    |                                     |                      | soon as new documents    |
|    |                                     |                      | come in, they invalidate |
|    |                                     |                      | the cache.               |
+----+-------------------------------------+----------------------+--------------------------+
| 3  | q=referenceNumber:{referenceNumber} | 99 calls in >=500ms  | The average of           |
|    | &fq={!cache=false cost=200}         | 1  call  in >=1000ms | calls is 734.5ms.        |
|    | startDate:[* to NOW/DAY]            |                      |                          |
|    | AND    endDate:[NOW/DAY to *]       |                      |                          |
+----+-------------------------------------+----------------------+--------------------------+

How is it possible that the additional date range filter query creates a 100x slowdown? From this blog, I would have expected similar performance from the daterange query as without the additional filter: http://yonik.com/advanced-filter-caching-in-solr/

Or is the only option is to change the softCommit/hardCommit delays, create 30 warmup fq for the past 30 days and tolerate poor performance on 25% of our queries?

Edit 1: Thanks for the answer, unfortunately, using integers instead of tdate does not seem to provide any performance gains. It can only leverage caching, like the query ID 2 above. That means we need a strategy of warmup of 30+ fq.

+----+-------------------------------------+----------------------+--------------------------+
| ID | Query                               | Results              | Comment                  |
+----+-------------------------------------+----------------------+--------------------------+
| 4  | fq={!cache=false}                   | 35 calls in <10ms    |                          |
|    | referenceNumber:{referenceNumber}   | 65 calls in >10ms    |                          |
+----+-------------------------------------+----------------------+--------------------------+
| 5  | fq={!cache=false}                   | 9 calls in >100ms    |                          |
|    | referenceNumber:{referenceNumber}   | 6 calls in >500ms    |                          |
|    | AND versionNumber:[2 TO *]          | 85 calls in >1000ms  |                          |
+----+-------------------------------------+----------------------+--------------------------+

edit 2: It seems that passing my referenceNumber from fq to q and setting different costs improve the query time (no perfect, but better). What's weird though is that the cost >= 100 is supposed to be executed as a postFilter, but setting the cost from 20 to 200 does not seem to impact performance at all. Does anyone know how to see whether a fq param is executed as a post filter?

+----+-------------------------------------+----------------------+--------------------------+
| 6  | fq={!cache=false cost=0}            | 89 calls in >100ms   |                          |
|    | referenceNumber:{referenceNumber}   | 11 calls in >500ms   |                          |
|    | &fq={!cache=false cost=200}         |                      |                          |
|    | startDate:[* TO NOW] AND            |                      |                          |
|    | endDate:[NOW TO *]                  |                      |                          |
+----+-------------------------------------+----------------------+--------------------------+
| 7  | fq={!cache=false cost=0}            | 36 calls in >100ms   |                          |
|    | referenceNumber:{referenceNumber}   | 64 calls in >500ms   |                          |
|    | &fq={!cache=false cost=20}          |                      |                          |
|    | startDate:[* TO NOW] AND            |                      |                          |
|    | endDate:[NOW TO *]                  |                      |                          |
+----+-------------------------------------+----------------------+--------------------------+
Arthur Burkhardt
  • 658
  • 4
  • 13
  • I am also having the same issue. Does anybody knows an answer – Deepak Janyavula Jun 21 '18 at 09:00
  • What is your precisionStep for the field? – MatsLindh Jun 21 '18 at 12:52
  • .. and is upgrading to a newer version of Solr an option? The DateRangeField as introduced later that uses the spatial features to provide proper range support. – MatsLindh Jun 21 '18 at 13:03
  • startDate/endDate have a precisionStep of 6, versionNumber has a precisionStep of 0. No, not possible to update solr because it's part of CDH 5.14. Solr 7 will be available in next version of CDH, due this year. – Arthur Burkhardt Jun 21 '18 at 14:19
  • I'm guessing the range of `versionNumber` is low enough that changing the precisionStep wouldn't do much anyway. A larger precisionStep could help for the dates to reduce the number of tokens generated slightly, but 6 or 8 are usually good values. `10` would be similar-ish to a resolution to the closest second for range searches. If you're only searching by day, an even larger precisionStep could be useful, but it's hard to say without testing with your data set and query profile. – MatsLindh Jun 22 '18 at 07:29

1 Answers1

1

Hi I have an another solution for you, it will give a good performance after performing same query to solr.

My Suggestion is store date in int format, please find below example.

 Your Start Date : 2017-03-01
 Your END Date : 2029-03-01

**Suggested format in int format. 
 Start Date : 20170301
 END Date : 20290301**

When you are trying fire same query with int number instead of dates it works faster as expected.

 So your query will be.
q=referenceNumber:{referenceNumber}
&fq=startNewDate:[* to YYMMDD]
AND    endNewDate:[YYMMDD to *] 

Hope it will help you ..

  • Thanks for the proposition. I've updated the description to reflect your solution, but it does not seem to provide any performance compared to a date:[* TO NOW/DAY] solution. – Arthur Burkhardt Jun 21 '18 at 12:36
  • 1
    TrieDate's are indexed as longs internally, so this is the same as using a regular datefield. – MatsLindh Jun 21 '18 at 12:51