Forgive me in advance if my terminology sounds a bit vague, but I am trying to explain my problem in plain English.
Let's say I have 10 sets of documents and for each set I have calculated the cosine similarity matrix based on the term frequency matrix of the set.
In R we can simulate my list of cosine similarity matrices like this
cosine_simil_mat <- list()
for (i in 1:10) {
cosine_simil_mat[[i]] <-
matrix(rnorm(n=100, mean=0.89, sd=.2),ncol=10)
}
Now, for each matrix, I can visualise the distance separating each document on a plane with this
mds <- cmdscale(1-cosine_simil_mat[[1]], eig=TRUE, k=2)
x <- mds$points[,1]
y <- mds$points[,2]
(note the 1-
part, since make more sense to visualise the dissimilarities (or distance) not the similarities)
I can plot my mds
with
mds_df <-
data.frame(x,y, type=c("alpha", c(rep("betas",dim(cosine_simil_mat[[i]])[1]-1)) ) )
require(ggplot2)
ggplot(mds_df, aes(x,y)) +
geom_point(aes(colour = type)) + geom_density2d() + theme_bw()
which plots
Now, what I want to do is to understand how my document of interest (alpha
), which is always the first row/column of the cosine similarity matrix, behaves in the different sets. Specifically, I want to measure the distance of my document alpha
from the densest part of each plot, in order to understand whether the document alpha
is at the core of the sets, measured in terms of relative term frequencies, or at the periphery and if its position change in the different sets.
Does any statistic capture this distance from the densest part of the plot? Does it make any sense?