0

I need to convert an index generated by Apache Lucene into another collection representation.

I currently have a collection of documents with many attributes.

I need to create document pairs with similarity measures from it, in order to pass them to classifiers.

Do you know any tutorial I could use to perform this ?

thanks

aneuryzm
  • 63,052
  • 100
  • 273
  • 488

1 Answers1

1

The similarity measures need to be based on a query. i.e. you query your Lucene document set and you get back a set of documents with relative scores.

If you want to compare every document with every other (is that right? it's hard to tell from the question) then you need to use a feature of each document as the basis for the queries.

For example, you could extract the top N terms (by frequency, excluding stop words) from each document. If you have X documents then you will have X queries. Then you execute each of your X queries against the index and you get back relative similarities of each document with every other. This is a matrix you could use for classification.

Another alternative would be to use the title, or synopsis of each document as the basis for the query (again, excluding stops).

Joel
  • 29,538
  • 35
  • 110
  • 138
  • Thanks, you perfectly understood what I meant. So, should I run a query for each document ? Successively I will save the results in a structured file to pass to a classifier. – aneuryzm Feb 24 '11 at 13:32
  • I actually already have a structured xml input and with description, tags, geolocation information for each document. For the description I will use tf.idf cosine similarity, for geotags I need to implement Harvesine similarity. I dunno exactly how to integrate such similarity metrics.. I will use tf.idf only for now, which should be implemented in Lucene. If you know any tutorial... very welcome since I don't have experience with Lucene... – aneuryzm Feb 24 '11 at 13:34
  • Yes, the default Scoring function in Lucene uses tdf.if & cosine similarity, so you can probably use it out of the box. You can customise it though. http://lucene.apache.org/java/2_4_0/scoring.html and also see http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html – Joel Feb 24 '11 at 15:00
  • ok thnaks. What about complex queries ? You know, I'm passing a document as query. THis means I have some several fields with text description (that I should process differently. i.e. stopwords only on few of them), then I have another numerical field with geo-coordinates. I should package all this stuff into my query. – aneuryzm Feb 25 '11 at 10:37
  • Also, should I use MatchAllDocsQuery to get all similarity values with all collection docs ? – aneuryzm Feb 25 '11 at 18:33