I collected a list of abstracts from online news websites and manually labelled them, by topic, using their original labels (e.g., politics, entertainment, sports, finance, etc.). Now I want to compare the similarity in word usage in abstracts between any two topics (say, abstracts labelled "politics" vs. those labelled under "finance"); however, because the number of news abstracts fall under each topic differ and the word length between any two abstracts also differ, which makes calculating document-by-document cosine similarity difficult.
So what I did was to reference the text2vec
vignette by dividing my example data by topic, parsing and stemming them, vectorizing the tokens in each abstract (i.e., the row entry) and building the dtm to create the vector space for comparison.
While the methods listed in text2vec
vignette are straightforward, the outputs are generated in matrix format. I am wondering if there's any way to get a single similarity measure (say, something between 0 and 1 or (-1, 1)) between any two sets of documents labelled under two different topics?
I provide my current code at below, a small 9-row data of news abstract that fall under 3 distinct topics are also provided (note that the number of documents belonging to each topic and their word length are all different: news pertaining to the topic "sports" have two entries, topic "politics" has four entries, and topic "finance" has three entries). Don't expect to get meaningful similarity result from such small data, it only serves as an example.
It will be really appreciated if someone could point out ways to modify from my existing code and get a single pair-wise similarity measure between any two topics.
# load required packages
library(foreign)
library(stringr)
library(text2vec)
news <- read.csv("https://www.dropbox.com/s/rikduji15mr5o89/news.csv?dl=1")
names(news)[1] <- "text"
as.character(news$text)
names(news)[2] <- "topic"
as.character(news$topic)
news$topic <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alnum:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
news$text_clean = prep_fun(news$text)
df <- news[c("topic", "text_clean")]
doc_set_1 <- df[which(df$topic==1), ]
doc_set_2 <- df[which(df$topic==2), ]
doc_set_3 <- df[which(df$topic==3), ]
it1 = itoken(doc_set_1$text_clean, progressbar = FALSE)
it2 = itoken(doc_set_2$text_clean, progressbar = FALSE)
it3 = itoken(doc_set_3$text_clean, progressbar = FALSE)
it = itoken(df$text_clean, progressbar = FALSE)
v = create_vocabulary(it)
# %>% prune_vocabulary(doc_proportion_max = 0.1, term_count_min = 5)
vectorizer = vocab_vectorizer(v)
dtm1 = create_dtm(it1, vectorizer)
dtm2 = create_dtm(it2, vectorizer)
dtm3 = create_dtm(it3, vectorizer)
# calculate jaccard distance
d1_d2_jac_sim = sim2(dtm1, dtm2, method = "jaccard", norm = "none")
d2_d3_jac_sim = sim2(dtm2, dtm3, method = "jaccard", norm = "none")
d1_d3_jac_sim = sim2(dtm1, dtm3, method = "jaccard", norm = "none")
# calculate cosine distance
d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
d2_d3_cos_sim = sim2(dtm2, dtm3, method = "cosine", norm = "l2")
d1_d3_cos_sim = sim2(dtm1, dtm3, method = "cosine", norm = "l2")
# calculate cosine distance adjusted for tf-idf
dtm = create_dtm(it, vectorizer)
tfidf = TfIdf$new()
dtm_tfidf = fit_transform(dtm, tfidf)
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf, method = "cosine", norm = "l2")
# any way to get tfidf_cos_sim for (d1, d3), (d2, d3)?