0

I am working on customizing the Highlighter plugin(using FVH) to output the position offset of query terms for a given search. So far I have been able to extract the offset information for normal queries using the code below. However, for Phrase queries the code returns the position offset of all the query terms(i.e. termSet) even when it is not part of the Phrase query. Therefore, I am wondering if there is a way in Lucene to get the offset information of only the matched phrase for Phrase queries using FVH?

// In DefaultSolrHighlighter.java::doHighlightingByFastVectorHighlighter()

SolrIndexSearcher searcher = req.getSearcher();
TermFreqVector[] tvector = searcher.getReader().getTermFreqVectors(docId);
TermPositionVector tvposition = (TermPositionVector) tvector[0];

 Set<String> termSet = highlighter.getHitTermSet (fieldQuery, fieldName);

 int[] positions;
 List hitOffsetPositions = new ArrayList<String[]>();

 for (String term : termSet)
 {
    int index = tvposition.indexOf(term); 
    positions = tvposition.getTermPositions(index);

    StringBuilder sb = new StringBuilder();
    for (int pos : positions)
    {
        if (!Integer.toString(pos).isEmpty())
            sb.append( pos ).append(',');
    }
    hitOffsetPositions.add(sb.substring(0, sb.length() - 1).toString());
 }

 if( snippets != null && snippets.length > 0 )
{
  docSummaries.add( fieldName, snippets );
  docSummaries.add( "hitOffsetPositions", hitOffsetPositions);
}


// In FastVectorHighlighter.java
// Wrapper function to get query Terms
   public Set<String> getHitTermSet (FieldQuery fieldQuery, String fieldName)
  {
      Set<String> termSet = fieldQuery.getTermSet( fieldName );
      return termSet;
  }

Current Output:

<lst name="6H500F0">
  <arr name="name">
  <str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
    <str>2</str>
    <str>3</str>
    <str>10</str>
</arr>

Expected Output:

<lst name="6H500F0">
  <arr name="name">
  <str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
    <str>2</str>
    <str>3</str>
</arr>

The field that I am trying to highlight has termVectors="true", termPositions="true" and termOffsets="true" and am using Lucene 3.1.0.

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
Jahangir
  • 685
  • 6
  • 13

1 Answers1

0

I wasn't able to get the FVH to handle phrase queries correctly, and wound up having to develop my own summarizer. The gist of my approach is discussed here; what I wound up doing is creating an array of objects, one for each term that I pulled from the queries. Each object contains a word index and its position, and whether it was already used in some match. These instances are the TermAtPosition instances in the sample below. Then, given position span and an array of word identities (indexes) corresponding to a phrase query, I iterated through the array, looking to match all term indexes within the given span. If I found a match, I marked each matching term as being consumed, and added the matching span to a list of matches. I could then use these matches to score sentences. Here is the matching code:

protected void scorePassage(TermPositionVector v, String[] words, int span, 
                    float score, SentenceScore[] scores, Scorer scorer) {
    TermAtPosition[] order = getTermsInOrder(v, words);
    if (order.length < words.length)
        return;
    int positions[] = new int[words.length];
    List<int[]> matches = new ArrayList<int[]>();
    for(int t=0; t<order.length; t++) {
        TermAtPosition tap = order[t];
        if (tap.consumed)
            continue;

        int p = 0;
        positions[p++] = tap.position;
        for(int u=0; u<words.length; u++) {
            if (u == tap.termIndex)
                continue;
            int nextTermPos = spanContains(order, u, tap.position, span);
            if (nextTermPos == -1)
                break;
            positions[p++] = nextTermPos;
        }
        // got all terms
        if (p == words.length)
            matches.add(recordMatch(order, positions.clone()));
    }
    if (matches.size() > 0)
        for (SentenceScore sentenceScore: scores) {
            for(int[] matchingPositions: matches)
                scorer.scorePassage(sentenceScore, matchingPositions, score);
    }
}


protected int spanContains(TermAtPosition[] order, int targetWord, 
                  int start, int span) {
    for (int i=0; i<order.length; i++) {
        TermAtPosition tap = order[i];
        if (tap.consumed || tap.position <= start || 
                       (tap.position > start + span))
            continue;
        if (tap.termIndex == targetWord)
            return tap.position;
    }
    return -1;
}

This approach seems to work, but it is greedy. Given a sequence "a a b c" it will it match the first a (leaving the second a alone), and then match b and c. I think a bit of recursion or integer programming could be applied to make it less greedy, but I couldn't be bothered, and wanted a faster rather than a more accurate algorithm anyway.

Community
  • 1
  • 1
Gene Golovchinsky
  • 6,101
  • 7
  • 53
  • 81
  • You have to know which terms are variants (implicitly ORed) and which are required for a match. I would process the required terms as above; to process variants (only one of which has to match), change the logic around the spanContains() call that calls it once for each variant, and keeps the return value closes to the required term. – Gene Golovchinsky Jun 01 '11 at 01:04