I'm using R 2.15 in Ubuntu distro.
I am applying a function to assign keywords to streaming text data from a popular social networking site. I aim to make the process more efficient by splitting the data into two parts and then applying the function:
textd<-data.frame(text=c("they","dont","think","it","be","like it is", "but it do"),keywordID=0)
textd<-split(textd, seq(nrow(textd)) %/% 2 %% 2 == 0)
keywords<-data.frame(kwds=c("be","do","is"),keywordID=1:3)
library(doParallel)
registerDoParallel(2)
library(foreach)
textd<-foreach (j = 1:2)%dopar%{
t<-textd[[j]]
for (i in keywords$kwds){ #for loop to assign keyword IDs
tmp<-grepl(i, t$text, ignore.case = T)
cond<-tmp & t$keywordID==0
if (length(t$keywordID[cond]) > 0){
t$keywordID[cond]<-keywords$keywordID[keywords$kwds==i]
#if kw field is already populated...
cond2<-tmp & t$keywordID!=0
extra<-t[cond2,]
if (length(extra$keywordID) > 0){
extra$keywordID<-keywords$keywordID[keywords$kwds==i]
t<-rbind(t,extra)}}
}
t
}
library(data.table)
textd<-as.data.frame(data.table::rbindlist(textd))
The problem is, doing it this way makes both cores use the same amount of RAM, meaning each core DOUBLES the amount of RAM used. It runs out quickly. What am I doing wrong? How do I get the RAM to split in quantity between cores? Thanks for looking.