0

learning how to compute tasks in R for large data sets (more than 1 or 2 GB), I am trying to use ff package and ffdfdply function. (See this link on how to use ffdfdply: R language: problems computing "group by" or split with ff package )

My data have the following columns:
"id" "birth_date" "diagnose" "date_diagnose"

There are several rows for each "id", and I want to extract the first date where there was a diagnose.

I would apply this :

library(ffbase)
library(plyr)
load(file=file_name); # to load my ffdf database, called data.f . 

my_fun <- function(x){
                      ddply( x , .(id), summarize, 
                      age  = min(date_diagnose - birth_date, na.rm=TRUE)) 
          }
result  <- ffdfdply(x = data.f, split = data.f$id,
                    FUN = function(x) my_fun(x) , trace=TRUE) ; 
result[1:10,] # to check.... 

It is very strange, but this command: ffdfdply(x = data.f, .... ) is making RStudio (and R) crash. Sometimes the same command will crash R and sometimes not. For example, if I trigger again the ffdfdply line (which worked the first time), R will crash.

Also using other functions, data, etc. will have the same effect. There is no memory increase, or anything into log.txt. Same behaviour when applying the summaryBy "technique"....

So if anybody has the same problem and found the solution, that would be very helpful. Also ffdfdply gets very slow (slower than SAS...) , and I am thinking about making another strategy to make this kind of tasks.

Is ffdfdply taking into account that for example the data set is ordered by id? (so it does not have to look into all the data to take the same ids... ).

So, if anybody knows other approaches to this ddply problem, it would be really great for all the "large data sets in R with low RAM memory" users.

This is my sessionInfo()

R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252   
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                   
[5] LC_TIME=Danish_Denmark.1252    

 attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] plyr_1.7.1   ffbase_0.6-1 ff_2.2-10    bit_1.1-9
Community
  • 1
  • 1
Miguel Vazq
  • 1,459
  • 2
  • 15
  • 21
  • ffdfdply does not take into account that your data is sorted by id. But in order to prevent doing a ffwhich too much if you have a lot of ID's, it will try to look which data of the id's can be put into ram in blocks. Hence, reducing the amount of usage of ffwhich. –  Nov 19 '12 at 15:41

1 Answers1

2

I also noticed this when using the package which we uploaded to CRAN recently. It seems to be caused by overloading in package ffbase the "[.ff" and "[<-.ff" extractor and setter functions from package ff.

I will remove this feature from the package and will upload it to CRAN soon. In the mean time, you can use the version 0.7 of ffbase, which you can get here: http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz

and install it as:

download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
shell("R CMD INSTALL ffbase_0.7.tar.gz")

Let me know if that helped.

  • Hi @jwijffels, thank you for the quick response. Yes, I have installed the package as you suggest, and now ffdfdply is working but subset() is not... makes R crash... Also about the second point (speed), do you think there is a way to increase speed is the data is ordered by "id", for example? – Miguel Vazq Nov 19 '12 at 15:19
  • Hi Miguel, good to know that you are encountering the same issues as I was running into last week. I also noticed since version 0.6-1 the issue with subset.ffdf. It is because the implementation was changed since version 0.5. I will notice the other package author as he wrote it such that we can upload a new version to CRAN. –  Nov 19 '12 at 15:29
  • About the speed issue. As I indicated in http://stackoverflow.com/questions/13398061/r-language-problems-computing-group-by-or-split-with-ff-package, don't pass in the whole ffdf like you do (ffdfdply(x = data.f ...) but pass in only the data you need (ffdfdply(x = data.f[c("id","date_diagnose","birth_date")], ... as otherwise you will get all your data from the ff files into RAM and you only use 3 columns in your function. Also note that ddply is not the fastest function - doBy is faster or use data.table inside the FUN. Hint. If you several CPU's, you can use package parallel. –  Nov 19 '12 at 15:32
  • I am aware of taking only the columns i need, but as it was crashing I was doing the simplest one to detect the cause. It seems to me that once the data are ordered by id, for example, making those summaries gets really much easier, because it is a matter of reading only once from 1 to nrows, and when the id is changing, then make the other summary... Couldn´t we make that via ff or other big-data packages? Any hint? Thank you anyway for the other suggestions. My goal is to have those king of (very common) tasks running faster than SAS or SQL without memory errors. – Miguel Vazq Nov 19 '12 at 21:23
  • Hi there, thanks for the feedback. Just a question also: have you already compared SAS speed to SQL speed to ff speed for your question as in https://stat.ethz.ch/pipermail/r-packages/2010/001178.html? Can you indicate the reason why first ordering the data would be more speedier than doing several ffwhich operations? –  Nov 20 '12 at 16:46
  • Hi, yes, it is a good thing to do this comparison. I will post the results as soon as i have them. The reason that ordering the data would speed up is that the program would only have to read indexes ONCE from 1 to nrows, in order, continously, and find when there is a change in "id" (for this example), store those "same id" data and calculate the function. Similar procedure is proposed here: http://r.789695.n4.nabble.com/Reading-big-files-in-chunks-ff-package-td4502070.html (in 4th post) (this seems to me to be slow). Use ffwhich for each "id" is looking to all the data each time, isn't it? – Miguel Vazq Nov 23 '12 at 07:29
  • If you have the comparison to SQL/SAS, do post somewhere the speed. Currently the number of scans over the id is once for doing a table (to see how much each id is occurring), next doing an ffwhich (over the whole table) as much times as the number of blocks of id's can be put into RAM. –  Nov 23 '12 at 08:42
  • FYI. The crash about subset.ffdf (which happened if you did not supply the select argument) is now solved in version 0.6-2 of package ffbase. See the news file: http://cran.r-project.org/web/packages/ffbase/NEWS –  Nov 27 '12 at 20:17
  • Hi, thank you for that update. About the speed, comparisons are almost impossible with my data/computer, because time tends to infinite with R, using any of the "techniques" we talked about (SAS will do it in 5 minutes, and R will do it in more than 1 hour). I also tested similar cases with smaller databases and got also those results: SAS/SQL are much better when "doing by id" when there are a lot of different "id" and only a few rows for each same "id" (so groups of same "id" are small and there are a lot of different groups). Any suggestion?? – Miguel Vazq Nov 29 '12 at 09:17
  • Yes, profile your code to see where the computational burden is. It is probably because you used ddply inside FUN which is not the fastest function in the world. –  Nov 29 '12 at 09:41
  • Hi jwijffels, and thank you for the answers. Yes, it seems that ddply is extremly slow, really. data.table is making that process 100x faster. While using ffdfdply my R app craches... oups. This is my code : age <- function(x){ xt <- data.table(x), xt[ ,list (min(date, na.rm=T), max(date, na.rm =T ) ), by=list(id)]} result <- ffdfdply(x = data[c("id", "date")], split = data$id, FUN = function(x) age(x), trace=TRUE). But anyway, I am getting closer via ffdfdply and data.table combo, once ffdfdply does not crash... – Miguel Vazq Nov 29 '12 at 14:03
  • Data.table is indeed a better option. But please make sure you have installed the latest version of ffbase on CRAN and make sure your FUN returns a data.frame as documented in ?ffdfdply, not a data.table. This will solve your speed and 'crash' issue. –  Nov 29 '12 at 14:54
  • Hi jwijffels, i still get those crashes, not only working with ffdfdply but also with some others [,] and subsettings. I am using last version of ff and ffbase.... – Miguel Vazq Dec 19 '12 at 13:41
  • Hi there, I'm almost certain that you did not install the last version of ffbase which is at CRAN. If that does not help, you can mail me directly instead of keep on commenting on stackoverflow –  Dec 19 '12 at 15:10