I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case).
I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error).
Is there a way to use a package like fread to get around this memory limitation or do I just have to invent a way to break up my data? Thanks much for the help, I've appended the code below:
x <- NULL
y <- NULL
num <- NULL
z <- NULL
ad <- NULL
for (i in 1:nrow(ad.corp$documents)){
num <- i
ad <- paste("ad.num",num,sep="_")
x <- subset(ad.corp, ad.corp$documents$num== yoad)
z <- x + corp.all
z$documents$texts <- as.character(z$documents$texts)
PolAdsDfm <- dfm(z, ignoredFeatures = stopwords("english"), groups = "num",stem=TRUE, verbose=TRUE, removeTwitter=TRUE)
PolAdsDfm <- tfidf(PolAdsDfm)
y <- similarity(PolAdsDfm, ad, margin="documents",n=20, method = "cosine", normalize = T)
y <- sort(y, decreasing=T)
if (y[1] > .7){assign(paste(ad,x$documents$texts,sep="--"), y)}
else {print(paste(ad,"didn't make the cut", sep="****"))}
}