9

I am looking for a way to find the terms that matched in the document using waldcard search in Lucene. I used the explainer to try and find the terms but this failed. A portion of the relevant code is below.

ScoreDoc[] myHits = myTopDocs.scoreDocs;
int hitsCount = myHits.Length;
for (int myCounter = 0; myCounter < hitsCount; myCounter++)
{
    Document doc = searcher.Doc(myHits[myCounter].doc);
    Explanation explanation = searcher.Explain(myQuery, myCounter);
    string myExplanation = explanation.ToString();
    ...

When I do a search on say micro*, documents are found and it enter the loop but myExplanation contains NON-MATCH and no other information.

How do I get the term that was found in this document ?

Any help would be most appreciated.

Regards

Puneet
  • 472
  • 3
  • 14

2 Answers2

8
    class TVM : TermVectorMapper
    {
        public List<string> FoundTerms = new List<string>();
        HashSet<string> _termTexts = new HashSet<string>();

        public TVM(Query q, IndexReader r) : base()
        {
            List<Term> allTerms = new List<Term>();
            q.Rewrite(r).ExtractTerms(allTerms);
            foreach (Term t in allTerms) _termTexts.Add(t.Text());
        }

        public override void SetExpectations(string field, int numTerms, bool storeOffsets, bool storePositions)
        {
        }

        public override void Map(string term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions)
        {
            if (_termTexts.Contains(term)) FoundTerms.Add(term);
        }
    }

    void TermVectorMapperTest()
    {
        RAMDirectory dir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(dir, new Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
        Document d = null;

        d = new Document();
        d.Add(new Field("text", "microscope aaa", Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
        writer.AddDocument(d);

        d = new Document();
        d.Add(new Field("text", "microsoft bbb", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
        writer.AddDocument(d);

        writer.Close();

        IndexReader reader = IndexReader.Open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);

        QueryParser queryParser = new QueryParser("text", new Lucene.Net.Analysis.Standard.StandardAnalyzer());
        queryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); 
        Query query = queryParser.Parse("micro*");

        TopDocs results = searcher.Search(query, 5);
        System.Diagnostics.Debug.Assert(results.TotalHits == 2);

        TVM tvm = new TVM(query, reader);
        for (int i = 0; i < results.ScoreDocs.Length; i++)
        {
            Console.Write("DOCID:" + results.ScoreDocs[i].Doc + " > ");
            reader.GetTermFreqVector(results.ScoreDocs[i].Doc, "text", tvm);
            foreach (string term in tvm.FoundTerms) Console.Write(term + " ");
            tvm.FoundTerms.Clear();
            Console.WriteLine();
        }
    }
L.B
  • 114,136
  • 19
  • 178
  • 224
  • Had to modify the TVM class to use HashTable for C#. Thanks worked as I wanted it to. – Puneet Sep 24 '11 at 08:53
  • You don't have to modify it with Lucene.Net 2.9.4g at https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g/src – L.B Sep 24 '11 at 09:09
4

One way is to use the Highlighter; another way would be to mimic what the Highlighter does by rewriting your query by calling myQuery.rewrite() with an appropriate rewriter; this is probably closer in spirit to what you were trying. This will rewrite the query to a BooleanQuery containing all the matching Terms; you can get the words out of those pretty easily. Is that enough to get you going?

Here's the idea I had in mind; sorry about the confusion re: rewriting queries; it's not really relevant here.

  TokenStream tokens = TokenSources.getAnyTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer);
CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
  // do something with termAtt, which holds the matched term
}
Mike Sokolov
  • 6,914
  • 2
  • 23
  • 31
  • Actually what I am looking for is to be able to get only those terms that are found in the document. So that if one document contains microscope and another contains microsoft, then when i am on the first document, I should get only microscope and when I am on the second document, I should get only microsoft as the matched term. Your suggestion would give me all the terms that would match micro* in the index field. I hope I am able to explain what I am looking for. – Puneet Sep 21 '11 at 12:29