Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms
and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
- To retrieve the most associated terms within particular field.
- To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
- Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
- If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
- If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.