Finding the best similarity measure for a group of documents

Question

As someone new to NLP, I am trying to find a solution to a problem that doesn't seem to be well documented - estimating the degree similarity for a group of documents as opposed to a pair of documents.

Say that I have two groups of words a and b , and I want to be able to claim that the words within one group are more similar to each other as a whole than the within the second group. To use a simple example:

library(stringdist)

a = c('foo', 'bar', 'baz', 'li')
b = c('foo', 'food', 'fo', 'fod')

a = as.data.frame(t(combn(a, 2)))
b= as.data.frame(t(combn(b, 2)))

distances = c()

for (i in 1:nrow(a)){
 cos_dist = stringdist(a$V1[i], a$V2[i], method="cosine")
 distances = c(distances, cos_dist)
}

print(mean(a_distances))
[1] 0.8888889

distances = c()

for (i in 1:nrow(b)){
  cos_dist = stringdist(b$V1[i], b$V2[i], method="cosine")
  distances = c(distances, cos_dist)
}

print(mean(b_distances))
[1] 0.1230863

Here, I am using the cosine similarity method (0 = identical, 1 = not similar) applied to all possible pairs of words within a group.

For those who are more experienced in NLP and string distance functions, does it make sense to use mean cosine distances for all pairs of documents as a measure of within-group similarity?

There are also several [similarity measures between groups](https://stats.stackexchange.com/a/171581/175631) for hierarchical clustering. — Anderson Green, Apr 22 '22 at 18:56

Finding the best similarity measure for a group of documents

0 Answers0