0

I want to cluster my indexed data in solr. Each solr document contains the following fields : id, title, url.

I have read solr 7.7 docs and the clustering algorithm mentioned there is applied only to the search result of each single query. And my need is a full index clustering based on the document title.

Anyone could help?

Soufiane Roui
  • 660
  • 6
  • 19

2 Answers2

2

As far as I'm aware, there's no out-of-the-box plugin for clustering the whole Solr index.

If you have some background in machine learning, have a look at Apache Mahout, it should be suitable for clustering a dataset of this size. Alternatively, there's a commercially-licensed Carrot2 spin-off we develop called Lingo4G, which is designed for clustering large collections of text. In both cases, however, there is no direct integration with Solr -- you'd need to handle the integration on your own.

Stanislaw Osinski
  • 1,231
  • 1
  • 7
  • 9
  • There is a configuration parameter which configures the clustering component in solr to cluster all the index data. The Parameter name is 'clustering.collection', but there isn't enough information in the [docs](https://lucene.apache.org/solr/guide/6_6/result-clustering.html#ResultClustering-ConfigurationParametersoftheClusteringComponent) of the next steps after setting this parameter to true. – Soufiane Roui Jan 02 '21 at 00:22
  • 1
    The `clustering collection` parameter is effectively dead, there is no Solr plugin that can cluster the whole collection. The search results clustering plugin (Carrot2) only clusters search results. You can increase the number of results rows that you retrieve to cluster a larger number of results, but search results clustering is performed in-memory, so it won't be able to handle more than ~10k results. Like I said in the main answer, to cluster 500k docs, you'd need to use external tools. – Stanislaw Osinski Jan 02 '21 at 10:22
0

Results clustering was removed in solr 8.x. The reason sited on the solr website was “The search results clustering contrib (Carrot2) has been removed from 8.x Solr due to lack of Java 1.8 compatibility in the dependency that provides online clustering of search results.”

Here is how I got it to work on JVM 11. All necessary files can be downloaded from this Github repo!

  1. Follow the instructions for installing the clustering contrib: https://solr.apache.org/guide/8_1/result-clustering.html
  2. Add solr-clustering-8.7.0.jar to /solr-8.x.x/dist directory (I tested this jar up to Solr version 8.11.1)
  3. Create /solr-8.x.x/contrib/clustering directory and copy the files in marked for contrib
  4. restart solr

Tested with java 11

rscavilla
  • 66
  • 7