2

I have a database that looks like this:

userId          SessionId        Screen         Platform       Version
01              1                first          IOS            1.0.1
01              1                main           IOS            1.0.1
01              2                first          IOS            1.0.1
01              3                first          IOS            1.0.1
01              3                main           IOS            1.0.1
01              3                detail         IOS            1.0.1
02              1                first          Android        1.0.2

Basically what I intend to do is to determine if a "path" (different screens) leads to a better retention or not. I would like to re-organize each sessionId in one column. Ideal database would look like this:

userId       SessionId       Path                 Retention
01           1               first;main           3
01           2               first                3
01           3               first;main;detail    3
02           1               first                1

Here is the variable Retention would be equal to the maximum SessionId.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Alban Couturier
  • 129
  • 2
  • 8

2 Answers2

1

A possible solution in base R:

d2 <- aggregate(Screen ~ userId + SessionId, d, toString)
transform(d2, retention = ave(Screen, userId, FUN = length))

which gives:

> d2
  userId SessionId              Screen retention
1     01         1         first, main         3
2     02         1               first         1
3     01         2               first         3
4     01         3 first, main, detail         3

An alternative using dplyr:

library(dplyr)
d %>% 
  group_by(userId, SessionId) %>% 
  summarise(Screen = toString(Screen)) %>% 
  group_by(userId) %>% 
  mutate(retention = n())

which gives:

  userId SessionId              Screen retention
   <chr>     <int>               <chr>     <int>
1     01         1         first, main         3
2     01         2               first         3
3     01         3 first, main, detail         3
4     02         1               first         1
Jaap
  • 81,064
  • 34
  • 182
  • 193
0

I have a data.table solution

library(data.table)
dt <- as.data.table(d)
dt[, Retention := max(SessionId), by = .(userId)]
dt[, .(Screen = paste(Screen, collapse = ";"), Retention = unique(Retention)), by = .(userId, SessionId)]

userId SessionId            Screen Retention
1:     01         1        first;main         3
2:     01         2             first         3
3:     01         3 first;main;detail         3
4:     02         1             first         1
sebastian-c
  • 15,057
  • 3
  • 47
  • 93
  • A cleaner alternative (at least imo): `dt[, .(Screen = toString(Screen)), by = .(userId, SessionId)][, retention := .N, by = userId][]` – Jaap Jul 15 '16 at 16:18
  • I assumed that the semicolons were important and that there may be sessionIds missing. – sebastian-c Jul 15 '16 at 16:29