0

In my solr database I have a structure that looks like this: A parent document representing names of people (dictionary). These parent documents also contain nested child documents where the documents that match these people's names appear (nested list of dictionaries).

When I try to cluster the information in a way that makes sense, I am only able to cluster directly the child documents, which results in a bunch of clustered keywords that belong to those texts.

Ideally, I would like to cluster people (parent documents) in terms of the similarity of their nested child documents. SO rather than having key words from texts clustered together, I would like to cluster people's names that have similar content.

E.g. if Bob, John, Lewis profiles all have child documents that contain the text "We are highly skilled in Python"; and Dan, Maria, Chris profiles have child documents that contain the text "We are highly skilled in Java". I would like a cluster of (Bob, John, Lewis) and a cluster of (Dan, Maria, Chris). So, when we click on the first cluster, we get the result "We are highly skilled in Python", and for the second cluster, we get the result "we are highly skilled in Java".

Is there a way of reproducing such a structure on carrot workbench?

blah
  • 674
  • 3
  • 17

1 Answers1

1

Unfortunately not. This is a pretty specific scenario and we aim to keep Workbench a generic tool with Solr being one of many document sources.

For this kind of parent-child clustering, you'd need to directly use Carrot2 Java or REST API:

  1. Fetch child documents from Solr, cluster them in Carrot2.
  2. For each cluster C:
    • create a new cluster CC with the same label as cluster C,
    • for each child document D in cluster C, take the child's parent document P and put the parent in cluster CC.
    • put cluster CC in the set of parent clusters.

As a result of the above procedure, you'll have a set of clusters containing parent documents clustered by the textual content of the documents' child documents.

Stanislaw Osinski
  • 1,231
  • 1
  • 7
  • 9
  • its nice to know that there is a possibility to get the clustering structure I desire. Would you be able to provide an example or clearer instructions to your statement : "cluster the child documents, build the parent clusters based on the parent-child document relationships obtained from Solr" – blah Jan 05 '21 at 11:41
  • 1
    See the edited response for some more details. – Stanislaw Osinski Jan 06 '21 at 15:43
  • Thanks. In step 2, are you creating sub-clusters then ? Would enabling subclustering in the solrconfig.xml file work for this or are these instructions only intended for Carrot2 Java (I don't work with Java unfortunately). – blah Jan 07 '21 at 10:04
  • 1
    I don't think it's possible to perform the whole process inside Solr. I'm not familiar with parent-child documents in Solr, but If you can somehow select the child documents along with parent references for each child doc, then you can run step 1 in Solr and then post-process the results (step 2) in whatever environment you use to send the query to Solr. – Stanislaw Osinski Jan 07 '21 at 12:19
  • Would the Lingo3G hierarchical clustering enable this parent-child relation clustering? – blah Jan 08 '21 at 11:18
  • 1
    No, the hierarchy would still be generated for child documents. To get clusters of parent documents, you must postprocess the result obtained from the clustering engine. Parent-child clustering cannot be done entirely within Solr or any other Carrot2 API. – Stanislaw Osinski Jan 10 '21 at 17:53
  • Ok, post-processing is the solution then. Thanks for clarifying! – blah Jan 11 '21 at 11:48