Using dplyr in r with large dataset (4 million rows)

Question

I m doing some data manipulation with dplyr that with my huge data(b) frame. I have been able to work successfully on smaller subsets of my data. I guess my problem is with the size of my data frame.

I have data frame that has 4 million rows and 34 columns.

My codes are as follows:

df<-b %>%
  group_by(Id) %>%
  mutate(numberoflead = n(),#lead sayısı
         lastcreateddateoflead=max(CreatedDate),#last date of lead
         firstcreateddateoflead=min(CreatedDate),#first date of lead
         lastcloseddate=max(Kapanma.tarihi....),#last closed date of kapanm tarihi
         yas=as.Date(lastcloseddate)-as.Date(firstcreateddateoflead),#yas
         leadduration=as.Date(lastcreateddateoflead)-as.Date(firstcreateddateoflead)) %>%#lead duration
  inner_join(b %>% 
               select(Id, CreatedDate, lasttouch = Lead_DataSource__c),
             by = c("Id" = "Id", "lastcreateddateoflead" = "CreatedDate")) %>% #lasttouch
  inner_join(b %>% 
               select(Id, CreatedDate, firsttouch = Lead_DataSource__c),
             by = c("Id" = "Id", "firstcreateddateoflead" = "CreatedDate")) %>%  #firsttouch
  inner_join(b %>% 
               select(Id, Kapanma.tarihi...., laststagestatus = StageName),#laststagestatus
             by = c("Id" = "Id", "lastcloseddate" = "Kapanma.tarihi...."))

It has worked well on smaller subset of my data frame but,when I run the codes above to my full data frames, it runs for a very long time and eventually crashes. I think that the problem may be with the 4 million rows of my data frame

Anyone have any suggestions on how to do this? Thanks a lot for help!

Try with `data.table` i.e. `setDT(b)[, c('numberoflead', 'lastcreateddateoflead') := .(.N, max(CreatedDate)), Id]` — akrun, Sep 07 '20 at 23:59
Also check out `dtplyr` (data-table backend to dplyr) and `dbplyr` (SQL database backend to dplyr) — Ben Bolker, Sep 08 '20 at 00:28
@BenBolker, I have tried with dtplyr but now ı got this error; Error: cannot allocate vector of size 17.5 Mb. Any idea about it? — Ozgur Alptekın, Sep 08 '20 at 15:09
that means you're still running into memory limitations. How much RAM do you have? You may need an out-of-memory solution (e.g. `dbplyr` or see https://cran.r-project.org/web/views/HighPerformanceComputing.html — Ben Bolker, Sep 08 '20 at 16:46

score 1 · Answer 1 · edited Jan 12 '22 at 11:38

I have been facing a similar issue recently with code around a similar size. I think your issue is the size of the R memory space. You can check the capacity above your global environment in your R editor. My memory has been overloaded by the big data size and then the program crashed regularly.

My solution was to write two separate pieces of code. In the first, I merged all my datasets to one file and ended the code with

saveRDS(file, file = "filename.Rds") # save as one object to save work space in working directory as R data file

Then closed the file, cleared R memory manually (click "free unused R memory") and started a new code in which I loaded the previously created file

setwd("PathWhereTheFileIsSaved") # set working directory
complete <- readRDS(file = "filename.Rds") # load previously in code 1 created data

Afterward, my code was working without overloading the memory.

Using dplyr in r with large dataset (4 million rows)

1 Answers1