10

Lucene (4.6) highlighter has very slow performance, when a frequent term is searched. Search is fast (100ms), but highlight may take more than an hour(!).

Details: great text corpus was used (1.5GB plain text). Performance doesn't depend if text is splitted into more small pieces or not. (Tested with 500MB and 5MB pieces as well.) Positions and offsets are stored. If a very frequent term or pattern is searched, TopDocs are retrieved fast (100ms), but each "searcher.doc(id)" calls are expensive (5-50s), and getBestFragments() are extremely expensive (more than 1 hour). Even they are stored and indexed for this purpose. (hardware: core i7, 8GM mem)

Greater background: it would serve a language analysis research. A special stemming is used: it stores the part of speech info, too. For example if "adj adj adj adj noun" is searched, it gives all its occurrences in the text with context.

Can i tune its performance, or should i choose another tool?

Used code:

            //indexing
            FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
            offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

            offsetsType.setStored(true);
            offsetsType.setIndexed(true);
            offsetsType.setStoreTermVectors(true);
            offsetsType.setStoreTermVectorOffsets(true);
            offsetsType.setStoreTermVectorPositions(true);
            offsetsType.setStoreTermVectorPayloads(true);


            doc.add(new Field("content", fileContent, offsetsType));


            //quering
            TopDocs results = searcher.search(query, limitStart+limit);

            int endPos = Math.min(results.scoreDocs.length, limitStart+limit);
            int startPos = Math.min(results.scoreDocs.length, limitStart);

            for (int i = startPos; i < endPos; i++) {
                int id = results.scoreDocs[i].doc;

                // bottleneck #1 (5-50s):
                Document doc = searcher.doc(id);

                FastVectorHighlighter h = new FastVectorHighlighter();

                // bottleneck #2 (more than 1 hour):   
                String[] hs = h.getBestFragments(h.getFieldQuery(query), m, id, "content", contextSize, 10000);

Related (unanswered) question: https://stackoverflow.com/questions/19416804/very-slow-solr-performance-when-highlighting

steve
  • 615
  • 6
  • 14

1 Answers1

4

BestFragments relies on the tokenization done by the analyzer that you're using. If you have to analyse such a big text, you'd better to store term vector WITH_POSITIONS_OFFSETS at indexing time.

Please read this and this book

By doing that, you won't need to analyze all the text at runtime as you can pick up a method to reuse the existing term vector and this will reduce the highlighting time.

AR1
  • 4,507
  • 4
  • 26
  • 42
  • Very interesting. I'm going to look into this. – Joshua Carmody Apr 10 '15 at 20:33
  • 1
    The code in question is already with positions and offsets. Should it be something else? @AR1 – Heidar Nov 22 '16 at 16:08
  • 1
    How is this answer can be upvoted? The question clearly states that positions and offsets are stored. And all this answer offers is to read the documentation? This has a -1 value. I spent 20 minutes trying to remember my password to log in to comment on this. – Dmytro Aleksin Aug 03 '21 at 07:46