0

From big population of documents I would like to find those similar to a predefined set of documents.

All documents inside the set are similar to each other, but very few documents from the population is similar to those in the set. Quite unbalanced situation.

As a first step I will calculate cosine similarity among all docs in population vs all docs from the set. Then for all docs I can extract features like maximum cosine similarity, top 10 average cosine similarity, number of docs from set with similarity greater than ...

But what approach to use then? What model?

  • It doesn't seem like classical classification problem as I don't have labels. Maybe I can mark all from set as class A and the rest would be class B.

  • I can also try rank all candidates but there are more features to rank by.

  • Clustering algorithms? But I don't have absolute coordinates in a space, I have just similarities - relative distances between each and every document. Is there clustering algorithm, that can handle this?

I have an idea how to validate the model. I can take part of the documents from the set, mix it with the population and check how many of them were found by model prediction.

user3757753
  • 41
  • 2
  • 7
  • This is a design question, better ask those on https://datascience.stackexchange.com/. Short answer: the ranking option is the most appropriate. – Erwan Nov 22 '22 at 00:34

0 Answers0