Computing cosine similarities on a large corpus in R using quanteda

Question

I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case).

I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error).

Is there a way to use a package like fread to get around this memory limitation or do I just have to invent a way to break up my data? Thanks much for the help, I've appended the code below:

x <- NULL
y <- NULL
num <- NULL
z <- NULL
ad <- NULL
for (i in 1:nrow(ad.corp$documents)){
  num <- i
  ad <- paste("ad.num",num,sep="_")
  x <- subset(ad.corp, ad.corp$documents$num== yoad)
  z <- x + corp.all
  z$documents$texts <- as.character(z$documents$texts)
  PolAdsDfm <- dfm(z, ignoredFeatures = stopwords("english"), groups = "num",stem=TRUE, verbose=TRUE, removeTwitter=TRUE)
  PolAdsDfm <- tfidf(PolAdsDfm)
  y <- similarity(PolAdsDfm, ad, margin="documents",n=20, method = "cosine", normalize = T)
  y <- sort(y, decreasing=T)
  if (y[1] > .7){assign(paste(ad,x$documents$texts,sep="--"), y)}
  else {print(paste(ad,"didn't make the cut", sep="****"))}  
}

Are you using the latest (GitHub) version? Sorry no computer until dec 27 but happy to address this then! — Ken Benoit, Dec 23 '15 at 13:05
I'll check out the GitHub version. Thanks for the help and I look forward to hearing from you after the holidays. All the best! — ModalBro, Dec 23 '15 at 18:01
OK I've figured out the issue here - `similarity()` coerces the sparse matrix into a dense matrix. I am going to change the underlying implementation to avoid this coercion. But currently (quanteda_0.9.1-7) it will not work if you do not have enough memory to contain the entire dense version of your matrix. I filed [Issue #84 on GitHub](https://github.com/kbenoit/quanteda/issues/84) about this. — Ken Benoit, Dec 29 '15 at 19:40
Thanks so much Ken. If I reinstall the package, will that solve the issue? — ModalBro, Dec 29 '15 at 20:40

Ken Benoit · Accepted Answer · 2017-12-21T10:21:46.800

The error was most likely caused by previous versions of quanteda (before 0.9.1-8, on GitHub as of 2016-01-01) that coerced dfm object into dense matrixes in order to call proxy::simil(). The newer version now works on sparse dfm objects without coercion for method = "correlation" and method = "cosine". (With more sparse methods to come soon.)

I can't really follow what you are doing in the code, but it looks like you are getting pairwise similarities between documents aggregated as groups. I would suggest the following workflow:

Create your dfm with the groups option for all groups of texts you want to compare.
Weight this dfm with tfidf() as you have done.
Use y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine") and then coerce this to a full, symmetric matrix using as.matrix(y). All of your pairwise documents are then in that matrix, and you can select on the condition of being greater than your threshold of 0.7 directly from that object.

Note that there is no need to normalise term frequencies with method = "cosine". In newer versions of quanteda, the normalize argument has been removed anyway, since I think it's a better workflow practice to weight the dfm prior to any computation of similarities, rather than building weightings into textstat_simil().

Final note: I strongly suggest not accessing the internals of a corpus object using the method you have here, since those internals may change and then break your code. Use texts(z) instead of z$documents$texts, for instance, and docvars(ad.corp, "num") instead of ad.corp$documents$num.

Computing cosine similarities on a large corpus in R using quanteda

1 Answers1

Linked