11

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.

So, my needs are:

  1. To retrieve the most associated terms within particular field.
  2. To retrieve the term, that is closest to the specified one within particular field.

I will rate answers in the following way:

  1. Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
  2. If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
  3. If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
ffriend
  • 27,562
  • 13
  • 91
  • 132
  • I googled the topic and I am awaiting the answer myself. BTW, Solr's clustering capabilities is described "as a way to group together semantically related results/documents". Not close enough, right? – Jesvin Jose Sep 14 '11 at 07:41
  • Clustering is a bit different thing. First of all, it works with documents, not terms, so you cannot cluster terms (at least I can't see any sense in terms clustering and don't know easy way to do it with Solr). Though it seems like the opposite thing is possible: you can use association mining over terms to perform clustering over documents. – ffriend Sep 14 '11 at 16:37

3 Answers3

3

You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.

I don't know of any LSA components for Solr.

Xodarap
  • 11,581
  • 11
  • 56
  • 94
  • Thanks for your answer, but actually I don't need LSA - I mentioned it to demonstrate that this topic is quite popular and it's strange that Solr still doesn't have any support for such tasks. As I mentioned, I already have the code to retrieve associations directly with Lucene, so I'm interested in how to do it with Solr. – ffriend Sep 14 '11 at 22:19
  • @ffriend: I'm not sure what you're asking. Solr has the same index format as Lucene, so any code which works for Lucene will work for Solr. – Xodarap Sep 15 '11 at 00:05
  • of course I know that Solr uses Lucene internally and I can write separate tool to access same index from Lucene and get what I want. But it is inconvenient: I will have 2 separate programs - Solr and my tool, install them differently, invoke them differently, etc. What I want is a Solr command or something like that to find associated terms. Of course, I can create custom RequestHandler and get exactly what I need (and actually this is what I'm going to do if there's no better option), but first I want to know whether something for this task is already there. – ffriend Sep 15 '11 at 00:31
  • @ffriend: Ah, I see. My guess is that writing your own RequestHandler will be the easiest, but maybe others know more than me. – Xodarap Sep 15 '11 at 14:21
2

Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.

I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:

  1. All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
  2. Documents where these terms occur, again, at least top documents.

Both these tasks may be easily done with standard Solr components.

To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).

Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.

Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.

ffriend
  • 27,562
  • 13
  • 91
  • 132
-1

You can get the count of occurrences of the current term in found document in the following query:

http://ip:port/solr/someinstance/select?defType=func&fl=termfreq(field,xxx),*&fq={!frange l=1}termfreq(field,xxx)&indent=on&q=termfreq(field,xxx)&sort=termfreq(field,xxx) desc&wt=json
Roman Marusyk
  • 23,328
  • 24
  • 73
  • 116