As someone new to NLP, I am trying to find a solution to a problem that doesn't seem to be well documented - estimating the degree similarity for a group of documents as opposed to a pair of documents.
Say that I have two groups of words a
and b
, and I want to be able to claim that the words within one group are more similar to each other as a whole than the within the second group. To use a simple example:
library(stringdist)
a = c('foo', 'bar', 'baz', 'li')
b = c('foo', 'food', 'fo', 'fod')
a = as.data.frame(t(combn(a, 2)))
b= as.data.frame(t(combn(b, 2)))
distances = c()
for (i in 1:nrow(a)){
cos_dist = stringdist(a$V1[i], a$V2[i], method="cosine")
distances = c(distances, cos_dist)
}
print(mean(a_distances))
[1] 0.8888889
distances = c()
for (i in 1:nrow(b)){
cos_dist = stringdist(b$V1[i], b$V2[i], method="cosine")
distances = c(distances, cos_dist)
}
print(mean(b_distances))
[1] 0.1230863
Here, I am using the cosine similarity method (0 = identical, 1 = not similar) applied to all possible pairs of words within a group.
For those who are more experienced in NLP and string distance functions, does it make sense to use mean cosine distances for all pairs of documents as a measure of within-group similarity?