16

I have a Solr 4.4.0 core configured that contains about 630k documents with an original size of about 10 GB. Each of the fields gets copied to the text field for purposes of queries and highlighting. When I execute a search without highlight, the results come back in about 100 milliseconds, but when highlighting is turned on, the same query takes 10-11 seconds. I also noticed that subsequent queries for the same terms continued to take about the same 10-11 seconds.

My initial configuration of the field was as follows

<field name="text" type="text_general" indexed="true" stored="true"
   multiValued="true"
   omitNorms="true"
   termPositions="true"
   termVectors="true"
   termOffsets="true" />

The query that is sent is similar to the following

http://solrtest:8983/solr/Incidents/select?q=error+code&fl=id&wt=json&indent=true&hl=true&hl.useFastVectorHighlighter=true

All my research seems to provide no clue as to why the highlight performance is so bad. On a whim, I decided to see if the omitNorms=true attribute could have an effect, I modified the text field, wiped out the data, and reloaded from scratch.

<field name="text" type="text_general" indexed="true" stored="true"
   multiValued="true"
   termPositions="true"
   termVectors="true"
   termOffsets="true" />

Oddly enough, this seemed to fix things. The initial query with highlighting took 2-3 seconds with subsequent queries taking less than 100 milliseconds.

However, because we want the omitNorms=true in place, my permanent solution was to have two copies of the "text" field, one with the attribute and one without. The idea was to perform queries against one field and highlighting against the other. So now the schema looks like

<field name="text" type="text_general" indexed="true" stored="true"
   multiValued="true"
   omitNorms="true"
   termPositions="true"
   termVectors="true"
   termOffsets="true" />

<field name="text2" type="text_general" indexed="true" stored="true"
   multiValued="true"
   termPositions="true"
   termVectors="true"
   termOffsets="true" />

And the query is as follows

http://solrtest:8983/solr/Incidents/select?q=error+code&fl=id&wt=json&indent=true&hl=true&hl.fl=text2&hl.useFastVectorHighlighter=true

Again, the data was cleared and reloaded with the same 630k documents but this time the index size is about 17 GB. (As expected since the contents on the "text" field is duplicated.)

The problem is that the performance numbers are back to the original 10-11 seconds each run. Either the first removal of omitNorms was a fluke or there is something else is going on. I have no idea what...

Using jVisualVM to capture a CPU sample shows the following two methods using most of the CPU

org.apache.lucene.search.vectorhighlight.FieldPhraseList.<init>()    8202 ms (72.6%)
org.eclipse.jetty.util.BlockingArrayQueue.poll()                     1902 ms (16.8%)

I have seen the init method as low as 54% and the poll number as high as 30%.

Any ideas? Any other places I can look to track down the bottleneck?

Thanks

Update

I have done a bunch of testing with the same dataset but different configurations and here is what I have found...although I do not understand my findings.

  • Speedy highlighting performance requires that omitNorms not be set to true. (Have no idea what omitNorms and highlighting has to do with one another.)
  • However, this is only seems to be true if both the query and highlighting are executed against the same field (i.e. df = hl.fl). (Again, no idea why...)
  • And another however, only if done against the default text field that exists in the schema.

Here is how I tested -->

  • Test was against about 525,000 documents
  • Almost all of the fields were copied to the multi-valued text field
  • In some tests, almost all of the fields were also copied to a send multi-valued text2 field (this field was identical to text except it had the opposite omitNorms setting
  • Each time the configuration was changed, the Solr instance was stopped, the data folder was deleted, and the instance was started back up

What I found -->

  • When just the text field was used and omitNorms = true was present, performance was bad (10 second response time)
  • When just the text field was used and omitNorms = true was not present, performance was great (sub-second response times)
  • When text did not have omitNorms = true and text2 did, queries wit highlighting against text returned in sub-second times, all other combinations resulted in 10-30 second response times.
  • When text did have omitNorms = true and text2 did not, all combinations of queries with highlighting returned in 7-10 seconds.

I am soooo confused....

Jason
  • 2,806
  • 2
  • 28
  • 38
  • Not an answer, but... can you try a [PostingsHighlighter](https://cwiki.apache.org/confluence/display/solr/Postings+Highlighter)? Another moment is that is requires less disk space - according to [Michael McCandless blog post](http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html) the difference is "~7.8X for the 10 million document English Wikipedia index". But it does not support wildcard searches. – rchukh Oct 17 '13 at 17:05
  • Also, [here](http://osdir.com/ml/solr-user.lucene.apache.org/2013-05/msg01706.html) are some similar issues. – rchukh Oct 17 '13 at 17:12
  • @rchukh - I already saw that support posting you linked to. That is where I got the idea to use jvisualvm and found those two methods taking up the huge amount of CPU. Regarding the other highlighter, I read about that too but have held off implementing because there is no logical reason why the Fast Vector one is not-so-fast with my dataset. (But if I have to I will move in that direction.) – Jason Oct 17 '13 at 18:43
  • Just to clarify - how many fields do you have? Are we talking about 10-20 or 100 and more? – rchukh Oct 17 '13 at 19:07
  • Following the idea(many similar terms in the field) from the mailing list above I was able to reproduce this issue... probably a little bit too much, because on my 524.07 MB index with 10204 docs it was **"QTime: 158465"** for a search request with a limit to 1 row... Strangely enough, the same request **_without_** hl.useFastVectorHighligter returned in 174 ms. – rchukh Oct 17 '13 at 20:06
  • 35 fields total in the document. But as I stated, the query and highlight only goes against ONE field...the text field which gets its content copied from the other 34 fields. – Jason Oct 18 '13 at 15:13
  • 1
    I answered a similar question [here][1] [1]: http://stackoverflow.com/questions/21683752/very-slow-highlight-performance-in-lucene/26438933#26438933 – AR1 Oct 19 '14 at 01:49
  • What's the server that you run Solr from? – SaidbakR Apr 27 '15 at 11:39

1 Answers1

1

I know that this is a bit dated, but I've ran into the same issue and wanted to chime in with our approach.

We are indexing text from a bunch of binary docs and need Solr to maintain some metadata about the document as well as text. Users need to search for docs based on metadata and full text search within the content as well as see highlights and snippets of relevant content. The performance problem gets worse if the content for highlighting/snippet is located further within each document (e.x. page 50 instead of page 2)

Due to poor performance of highlighting, we had to break up each document into multiple solr records. Depending on the length of the content field, we will chop it up into smaller chunks, copy the metadata attributes to each record and assign a per-document unique id to each record. Then at query time, we will search the content field of all these records and group by that unique field we assigned. Since the content field is smaller, Solr will not have to go deep into each content field, plus from an end user standpoint, this is completely transparent; although it does add a bit of indexing overhead for us.

Additionally, if you choose this approach, you may want to consider overlapping the seconds a little bit between each "sub document" to ensure that if there is phrase match at the boundary of two seconds it will get properly returned.

Hope it helps.

nick_v1
  • 1,654
  • 1
  • 18
  • 29