1

I have the following dataset

client_id <- c("A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "B")
value <- c(10, 35, 20, 30, 50, 40, 30, 40, 30, 40, 10)
period_30 <- c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
period_60 <- c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
sign <- c("D", "D", "D", "D", "C", "C", "C", "D", "D", "D", "D")

data <- data.frame(client_id, value, period_30, period_60, sign)

I can use this code to count the number of different splits per given period with the code below:

library(data.table)
test<- dcast(setDT(data), client_id ~ paste0("period_30", sign), value.var = "period_30", sum)

But I would like to also calculate the value as per the different splits.

The expected outcome would look like this:

client_id       av.value_period_30_sign_D   av.value_period_60_sign_D   av.value_period_30_sign_C   av.value_period_30_sign_D
    A                     34.16667                      NaN                  NaN                                   NaN
    B                     30.00000                    34.16667               NaN                               27.50000

And then, it should be extendable to additional splits, like average value of sign X, of type X in period 1.

I am not sure if the desired output is doable with this approach. But I was looking at the fun.aggregate argument. Perhaps it could be used in combination with multiple value.var arguments?

Update: Joel's code answers the first part of the question.

client_id   sign    period_30   period_60 
    A         D     34.16667    34.16667
    B         D     30.00000    34.16667
    B         C     NaN         27.50000

But how do I transpose the variables and assign the names as per the splits automatically?

Sotos
  • 51,121
  • 6
  • 32
  • 66
Prometheus
  • 1,977
  • 3
  • 30
  • 57

2 Answers2

1

another method(would be faster) is using data.table

Based on the edit made to the question :(hope the code is self explanatory now)

library(data.table)
data1 <- setDT(data)[, lapply(.SD, function(x) mean(value[x==1])),
                      .SDcols = period_30:period_60,
                      by = .(client_id, sign)]
# `dcast` if also from `data.table` package
dcast(data1, client_id~sign, drop = FALSE, value.var = c("period_30", "period_60"))
#   client_id period_30_C period_30_D period_60_C period_60_D
#1:         A          NA    34.16667          NA    34.16667
#2:         B         NaN    30.00000        27.5    34.16667
joel.wilson
  • 8,243
  • 5
  • 28
  • 48
  • 1
    Hello Joel, I just updates the question. Indeed, your code answers the first part of the problem. Do you have any idea how I can solve the transposing/naming part? It would be extremely helpful for me. – Prometheus Feb 16 '17 at 12:46
  • Thank you Joel! Its a life saver :) – Prometheus Feb 17 '17 at 16:02
  • Hi Joel, I have been testing the code with different datasets, and I realized that the grouping fails. So if there is more than one user or sign, the result is incorrect. Do you have any idea what might be going wrong? It works fine for one user tho. – Prometheus Feb 20 '17 at 14:01
  • Perhaps something similar to this could be a solution: check <- setDT(data)[, lapply(.SD, function(x) {ifelse(x == 1, NA, sum(data$value))}), .SDcols = PERIOD_30:PERIOD_365, by = (client_id)] – Prometheus Feb 20 '17 at 14:21
0

One could use dplyr; Given the current df (=test):

df %>% group_by(sign) %>% summarize(avg.val=mean(value),avg.period1=mean(period_1),avg.period2=mean(period_2),avg.period3=mean(period_3))

which gives:

# A tibble: 2 × 5
    sign avg.val avg.period1 avg.period2 avg.period3
   <chr>   <dbl>       <dbl>       <dbl>       <dbl>
1 Credit   39.50        0.50         1.0           1
2  Debit   36.25        0.25         0.5           1

You could change the grouping variable in group to meet your needs.

thisisrg
  • 596
  • 3
  • 12
  • I need to group by multiple splits - customer, sign, type. And second, with this approach you have to manually label each variable. So if I have 10 periods, 10 signs, 10 types, it becomes impossibe to implement :/ Thats why I cant to rely on dcast. – Prometheus Feb 15 '17 at 18:07
  • If I understood correctly, for the second part you could use summarise_each. Also you could group by multiple variables as well. I was just working on the df you provided. – thisisrg Feb 15 '17 at 18:17