Solr Highlight matching query terms

Question

I am using Solr to do a fuzzy search (e.g., foo~2 bar~2). Highlighting allows me to highlight matching document fragments from the resultset.

For example:

Result 1: food bars Result 2: mars bar

and so on.

For each match highlighted from the document, I need to figure out which query terms did these fragments matched against along with offsets of those query terms in the query. Something like:

Result 1: {food MATCHED_AGAINST foo QUERY_OFFSET 0,2} { bars MATCHED_AGAINST bar QUERY_OFFSET 3,5} Result 2: mars {bar MATCHED_AGAINST bar QUERY_OFFSET 3,5}

Is there a way to do this in Solr?

You could probably write custom code for it, but this is not supported out of the box. — Hector Correa, Dec 18 '18 at 21:36
True, I suppose this is easier in `Lucene`, but looks like I will have to write a plugin. Was hoping if things have changed over last few versions :-) — Salil, Dec 19 '18 at 05:15
@Mysterion, can you please shade some light on what kind of customization I can do here? — Salil, Dec 19 '18 at 07:41

score 2 · Answer 1 · answered Dec 28 '18 at 13:32

One of the possibility would be to customize Highlighter that will produce needed information. Idea is simple - you have method

org.apache.lucene.search.highlight.Highlighter#getBestTextFragments

in this method you have low-level access to the QueryScorer which consists of several useful attributes like

private Set<String> foundTerms;
private Map<String,WeightedSpanTerm> fieldWeightedSpanTerms;
private Query query;

I'm pretty much sure, that using this information you should be able to produce needed output

score 0 · Accepted Answer · answered Dec 30 '18 at 04:12

One hack I could figure out is to use different (unique) boost factors for each term in the query, and then retrieving boost factors for each matched term from the debug score so as to deduce which term that score came from.

For example, we can query with foo~2^3.0 bar~2^2.0 (boost scores from bar by 2.0, keep scores from matching against foo untouched). From the debug score output, check the boost factors:

Result 1: food bars: score <total score 1> = food * 3.0 * <other scoring terms> + bars * 2.0 * <other scoring terms>
Result 2: mars bar: score <total score 2> = bar * 2.0 * <other scoring terms>

From which it is clear that food matched with boost factor of 3.0, and bars as well as bar matched with boost factor of 2.0. Maintaining a lookup dictionary for which term had what boost to begin with, it is easy to figure out which terms matched.

Two factors to consider:

If the boost factor is 1.0, solr debug score does not print it.
Solr might incorporate some default boost factor for the term based on fuzzy matching, TF-IDF, etc. In this case, the boost factor that shows up will not match against the boosts we supplied in the query. For this reason, we need to execute our query twice - once without any boosting (to understand default boosting for every term), and once with boosting (to see how much it has changed now).

Hope this helps someone.

Unfortunately, debug is something which isn't recommended to be used all time on production system — Mysterion, Dec 30 '18 at 10:03
Well, this is only a `hack` which can solve the issue without writinga any custom code or plugin :-) — Salil, Dec 31 '18 at 11:11

Solr Highlight matching query terms

2 Answers2