Using foreach in R doubles memory usage

Question

I'm using R 2.15 in Ubuntu distro.

I am applying a function to assign keywords to streaming text data from a popular social networking site. I aim to make the process more efficient by splitting the data into two parts and then applying the function:

textd<-data.frame(text=c("they","dont","think","it","be","like it is", "but it do"),keywordID=0)

textd<-split(textd, seq(nrow(textd)) %/% 2 %% 2 == 0)
keywords<-data.frame(kwds=c("be","do","is"),keywordID=1:3)

library(doParallel)
registerDoParallel(2)
library(foreach)


textd<-foreach (j = 1:2)%dopar%{
  t<-textd[[j]]

  for (i in keywords$kwds){    #for loop to assign keyword IDs

    tmp<-grepl(i, t$text, ignore.case = T)
    cond<-tmp & t$keywordID==0
    if (length(t$keywordID[cond]) > 0){
      t$keywordID[cond]<-keywords$keywordID[keywords$kwds==i]

      #if kw field is already populated...
      cond2<-tmp & t$keywordID!=0
      extra<-t[cond2,]
      if (length(extra$keywordID) > 0){
        extra$keywordID<-keywords$keywordID[keywords$kwds==i]

        t<-rbind(t,extra)}}
  }
  t
}


library(data.table)
textd<-as.data.frame(data.table::rbindlist(textd))

The problem is, doing it this way makes both cores use the same amount of RAM, meaning each core DOUBLES the amount of RAM used. It runs out quickly. What am I doing wrong? How do I get the RAM to split in quantity between cores? Thanks for looking.

Similar problem (unanswered, but good suggestions): http://stackoverflow.com/questions/13942202/r-and-shared-memory-for-parallelmclapply — Clayton Stanley, Apr 15 '14 at 05:23

score 0 · Answer 1 · answered Jun 11 '14 at 21:46

Try splitting the data within the loop. like this:

library(itertools)
registerDoParallel(2)


textd<-foreach (t=isplitRows(textd, chunks=2), .combine=rbind,)%dopar%{ 

for (i in keywords$kwds){    #for loop to assign keyword IDs

tmp<-grepl(i, t$text, ignore.case = T)
cond<-tmp & t$keywordID==0
if (length(t$keywordID[cond]) > 0){
  t$keywordID[cond]<-keywords$keywordID[keywords$kwds==i]

  #if kw field is already populated...
  cond2<-tmp & t$keywordID!=0
  extra<-t[cond2,]
  if (length(extra$keywordID) > 0){
    extra$keywordID<-keywords$keywordID[keywords$kwds==i]

    t<-rbind(t,extra)}}
  }
 return(t)
 }

Using foreach in R doubles memory usage

1 Answers1