I m doing some data manipulation with dplyr that with my huge data(b
) frame.
I have been able to work successfully on smaller subsets of my data. I guess my problem is with the size of my data frame.
I have data frame that has 4 million rows and 34 columns.
My codes are as follows:
df<-b %>%
group_by(Id) %>%
mutate(numberoflead = n(),#lead sayısı
lastcreateddateoflead=max(CreatedDate),#last date of lead
firstcreateddateoflead=min(CreatedDate),#first date of lead
lastcloseddate=max(Kapanma.tarihi....),#last closed date of kapanm tarihi
yas=as.Date(lastcloseddate)-as.Date(firstcreateddateoflead),#yas
leadduration=as.Date(lastcreateddateoflead)-as.Date(firstcreateddateoflead)) %>%#lead duration
inner_join(b %>%
select(Id, CreatedDate, lasttouch = Lead_DataSource__c),
by = c("Id" = "Id", "lastcreateddateoflead" = "CreatedDate")) %>% #lasttouch
inner_join(b %>%
select(Id, CreatedDate, firsttouch = Lead_DataSource__c),
by = c("Id" = "Id", "firstcreateddateoflead" = "CreatedDate")) %>% #firsttouch
inner_join(b %>%
select(Id, Kapanma.tarihi...., laststagestatus = StageName),#laststagestatus
by = c("Id" = "Id", "lastcloseddate" = "Kapanma.tarihi...."))
It has worked well on smaller subset of my data frame but,when I run the codes above to my full data frames, it runs for a very long time and eventually crashes. I think that the problem may be with the 4 million rows of my data frame
Anyone have any suggestions on how to do this? Thanks a lot for help!