Calculate aggrgeate cosine and Jaccard distance between two sets of documents

Question

I collected a list of abstracts from online news websites and manually labelled them, by topic, using their original labels (e.g., politics, entertainment, sports, finance, etc.). Now I want to compare the similarity in word usage in abstracts between any two topics (say, abstracts labelled "politics" vs. those labelled under "finance"); however, because the number of news abstracts fall under each topic differ and the word length between any two abstracts also differ, which makes calculating document-by-document cosine similarity difficult.

So what I did was to reference the text2vec vignette by dividing my example data by topic, parsing and stemming them, vectorizing the tokens in each abstract (i.e., the row entry) and building the dtm to create the vector space for comparison.

While the methods listed in text2vec vignette are straightforward, the outputs are generated in matrix format. I am wondering if there's any way to get a single similarity measure (say, something between 0 and 1 or (-1, 1)) between any two sets of documents labelled under two different topics?

I provide my current code at below, a small 9-row data of news abstract that fall under 3 distinct topics are also provided (note that the number of documents belonging to each topic and their word length are all different: news pertaining to the topic "sports" have two entries, topic "politics" has four entries, and topic "finance" has three entries). Don't expect to get meaningful similarity result from such small data, it only serves as an example.

It will be really appreciated if someone could point out ways to modify from my existing code and get a single pair-wise similarity measure between any two topics.

# load required packages
library(foreign)
library(stringr)
library(text2vec)

news <- read.csv("https://www.dropbox.com/s/rikduji15mr5o89/news.csv?dl=1")
names(news)[1] <- "text"
as.character(news$text)
names(news)[2] <- "topic"
as.character(news$topic)
news$topic <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)

prep_fun = function(x) {
  x %>% 
    # make text lower case
    str_to_lower %>% 
    # remove non-alphanumeric symbols
    str_replace_all("[^[:alnum:]]", " ") %>% 
    # collapse multiple spaces
    str_replace_all("\\s+", " ")
}

news$text_clean = prep_fun(news$text)
df <- news[c("topic", "text_clean")]
doc_set_1 <- df[which(df$topic==1), ]
doc_set_2 <- df[which(df$topic==2), ]
doc_set_3 <- df[which(df$topic==3), ]

it1 = itoken(doc_set_1$text_clean, progressbar = FALSE)
it2 = itoken(doc_set_2$text_clean, progressbar = FALSE)
it3 = itoken(doc_set_3$text_clean, progressbar = FALSE)

it = itoken(df$text_clean, progressbar = FALSE)
v = create_vocabulary(it) 
# %>% prune_vocabulary(doc_proportion_max = 0.1, term_count_min = 5)
vectorizer = vocab_vectorizer(v)

dtm1 = create_dtm(it1, vectorizer)
dtm2 = create_dtm(it2, vectorizer)
dtm3 = create_dtm(it3, vectorizer)

# calculate jaccard distance
d1_d2_jac_sim = sim2(dtm1, dtm2, method = "jaccard", norm = "none")
d2_d3_jac_sim = sim2(dtm2, dtm3, method = "jaccard", norm = "none")
d1_d3_jac_sim = sim2(dtm1, dtm3, method = "jaccard", norm = "none")

# calculate cosine distance
d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
d2_d3_cos_sim = sim2(dtm2, dtm3, method = "cosine", norm = "l2")
d1_d3_cos_sim = sim2(dtm1, dtm3, method = "cosine", norm = "l2")

# calculate cosine distance adjusted for tf-idf
dtm = create_dtm(it, vectorizer)
tfidf = TfIdf$new()
dtm_tfidf = fit_transform(dtm, tfidf)
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf, method = "cosine", norm = "l2")

# any way to get tfidf_cos_sim for (d1, d3), (d2, d3)?

Not sure if I fully understand the problem, but If your goal is to measure similarity between topics then it means that you need to combine (concatenate) all the documents from each topic into single "meta" document. So at the end you will have `dtm` with as many rows as number of topics you have. After that you can use `sim2` functions. — Dmitriy Selivanov, Oct 02 '19 at 11:48
@Dmitriy thanks for the headup, my current solution is to apply `mean()` on `sim2` output, so that it averages the all cosine similarity scores of the matrix formed by any two sets of documents. — Chris T., Oct 02 '19 at 17:26
why dtm_1_tfidf = fit_transform(dtm1, tfidf) doesn't work for you? actually you embed all the topics into the same vector space once (single dtm). Apply tf-idf and then subset dtm with `dtm1 = dtm[which(df$topic==1), ]` — Dmitriy Selivanov, Oct 03 '19 at 05:26
@Dmitriy thanks again for your reply. Do you mean I can create tf-idf adjusted dtm for set 1 and 2 by applying `fit_transform(, tfidf)` on `dtm1` and `dtm2` and then use the two outputs `dtm_1_tfidf` and `dtm_2_tfidf` as the first and second argument in `sim2()` to obtain cosine similarity between dtm1 and dtm2? (similarly for (dtm1, dtm3) and (dtm2, dtm3)?) — Chris T., Oct 03 '19 at 18:17

Calculate aggrgeate cosine and Jaccard distance between two sets of documents

0 Answers0