0

Trying to use carrot2 for doing to resultset clustering. I have couple of questions with respect to this.

a) Can we cluster the documents in Solr/Lucene based on the specific fields in solr? like cluster them based name, person name and geo-distance location (lat, long) with specific field weights?

b) My use case for clustering is not really online, it is more of a batch use case, given that, do we still have this restriction of 1K max no. of results?

Ganesh
  • 573
  • 2
  • 13

1 Answers1

0

Carrot2 performs clustering based only on the natural text of your documents. Person names would probably be too short for meaningful clustering; Carrot2 is not suitable for geo-distance and other numerical data.

The 1k restriction / recommendation is based on the design goal of Carrot2: to cluster small collections of texts (such as search results) fast enough so that the process can be done on-line. Carrot2 does well for collections around 1k documents, but will not scale very well beyond several thousands of documents.

Stanislaw Osinski
  • 1,231
  • 1
  • 7
  • 9
  • Thanks. in Solr i can do a solr query and get a score which comprise of multiple fields, weights and geo distance. If we can use this score as a measure of distance to cluster it would good. If there is no option now, is this in carrot2 vision? – Ganesh Jan 06 '14 at 21:00
  • Also can you please let me know what is "natural text"? i believe it is any field in solr (single or composite/copy fields). – Ganesh Jan 06 '14 at 21:19
  • Carrot2 was designed specifically for clustering natural text, such as web page content, news articles, scientific papers etc. It doesn't internally use the classic clustering algorithms that rely on the distance measures, so it won't work for numeric data. We don't plan adding numeric clustering to Carrot2 because there are a lot of other open source projects that do that very well. – Stanislaw Osinski Jan 07 '14 at 21:09