2

I need to do a fast aggregation by id_client of dates: min, max, difference of dates in months and quantity of months.

Example table:

tbl<-data.frame(id_cliente=c(1,1,1,1,2,3,3,3), 
fecha=c('2013-01-01', '2013-06-01','2013-05-01', '2013-04-01', '2013-01-01', '2013-01-01','2013-05-01','2013-04-01'))

Format dates:

tbl$fecha<-as.Date(as.character(tbl$fecha))

My first approach was ddply:

tbl2<-ddply(tbl, .(id_cliente), summarize, cant=length(id_cliente), 
max=max(fecha), min=min(fecha),
dif=length(seq(from=min, to=max, by='month')))

I got the desired result, but with my real table takes too much time. So I tried tapply:

tbl3<-data.frame(cbind(dif=tapply(tbl$fecha, list(tbl$id_cliente), secuencia),
        hay=tapply(tbl$fecha, list(tbl$id_cliente), length),
        min=tapply(tbl$fecha, list(tbl$id_cliente), min),
        max=tapply(tbl$fecha, list(tbl$id_cliente), max)
        ))

The result was:

> tbl3
  dif hay   min   max
   6   4 15706 15857
   1   1 15706 15706
   5   3 15706 15826

In this case I got instead of dates, numbers. So since the following works, I tried using as.Date inside tapply:

as.Date(15706, origin='1970-01-01')

MIN<-function(x){as.Date(min(x), origin='1970-01-01')}

The function works but with tapply doesn't.

tbl3<-data.frame(cbind(min=tapply(tbl$fecha, list(tbl$id_cliente), MIN)))

And I still got the number instead of date. How can I solve this? Thanks.

GabyLP
  • 3,649
  • 7
  • 45
  • 66

2 Answers2

2

I know this is a bit late, but I figured I would put this here for the people still googling this issue.

Interestingly, tapply returns the correct results when you keep the date column in text format and then you can convert to a date after:

tbl<-data.frame(id_cliente=c(1,1,1,1,2,3,3,3), 
                fecha=c('2013-01-01', '2013-06-01','2013-05-01', '2013-04-01', '2013-01-01', '2013-01-01','2013-05-01','2013-04-01'))
tbl3<-data.frame(cbind(dif=tapply(tbl$fecha, list(tbl$id_cliente), seq),
                        hay=tapply(tbl$fecha, list(tbl$id_cliente), length),
                        min=tapply(tbl$fecha, list(tbl$id_cliente), min),
                        max=tapply(tbl$fecha, list(tbl$id_cliente), max))) 
head(tbl3)
#         dif hay        min        max
# 1, 2, 3, 4   4 2013-01-01 2013-06-01
#          1   1 2013-01-01 2013-01-01
#    1, 2, 3   3 2013-01-01 2013-05-01
B. Stanley
  • 234
  • 2
  • 7
0

With base R, the ?Date class is converted to the number of days from Jan. 1, 1970. Try using dplyr or data.table to preserve the date class:

dplyr

library(dplyr)
tbl %>% group_by(id_cliente) %>%
        summarise(dif=length(seq(min(fecha), max(fecha), by='month')),
                  hay=length(fecha),
                  min=min(fecha),
                  max=max(fecha))
# Source: local data frame [3 x 5]
# 
#   id_cliente dif hay        min        max
# 1          1   6   4 2013-01-01 2013-06-01
# 2          2   1   1 2013-01-01 2013-01-01
# 3          3   5   3 2013-01-01 2013-05-01

data.table

library(data.table)
setDT(tbl)[,.(dif=length(seq(min(fecha), max(fecha), by='month')),
              hay= .N,
              min=min(fecha),
              max=max(fecha)), by=id_cliente]
#    id_cliente dif hay        min        max
# 1:          1   6   4 2013-01-01 2013-06-01
# 2:          2   1   1 2013-01-01 2013-01-01
# 3:          3   5   3 2013-01-01 2013-05-01
Pierre L
  • 28,203
  • 6
  • 47
  • 69