Calculate distance from densest part of cosine similarity 2d distribution

Question

Forgive me in advance if my terminology sounds a bit vague, but I am trying to explain my problem in plain English.

Let's say I have 10 sets of documents and for each set I have calculated the cosine similarity matrix based on the term frequency matrix of the set.

In R we can simulate my list of cosine similarity matrices like this

cosine_simil_mat <- list()
for (i in 1:10) {
  cosine_simil_mat[[i]] <-
    matrix(rnorm(n=100, mean=0.89, sd=.2),ncol=10)
}

Now, for each matrix, I can visualise the distance separating each document on a plane with this

mds <- cmdscale(1-cosine_simil_mat[[1]], eig=TRUE, k=2)
x <- mds$points[,1]
y <- mds$points[,2]

(note the 1- part, since make more sense to visualise the dissimilarities (or distance) not the similarities)

I can plot my mds with

mds_df <-
  data.frame(x,y, type=c("alpha", c(rep("betas",dim(cosine_simil_mat[[i]])[1]-1)) ) )

require(ggplot2)
ggplot(mds_df, aes(x,y)) +
  geom_point(aes(colour = type)) + geom_density2d() + theme_bw()

which plots

Now, what I want to do is to understand how my document of interest (alpha), which is always the first row/column of the cosine similarity matrix, behaves in the different sets. Specifically, I want to measure the distance of my document alpha from the densest part of each plot, in order to understand whether the document alpha is at the core of the sets, measured in terms of relative term frequencies, or at the periphery and if its position change in the different sets.

Does any statistic capture this distance from the densest part of the plot? Does it make any sense?

Also, you should `set.seed()` to make your example fully reproducible. — , Sep 16 '15 at 09:23

Calculate distance from densest part of cosine similarity 2d distribution

0 Answers0