3

I have long table with 97M rows. Each row contains the information of an action taken by a person and the timestamp for that action, in the form:

actions <- c("walk","sleep", "run","eat")
people <- c("John","Paul","Ringo","George")
timespan <- seq(1000,2000,1)

set.seed(28100)
df.in <- data.frame(who = sample(people, 10, replace=TRUE),
                    what = sample(actions, 10, replace=TRUE),
                    when = sample(timespan, 10, replace=TRUE))

df.in
#       who  what when
# 1    Paul   eat 1834
# 2    Paul sleep 1295
# 3    Paul   eat 1312
# 4   Ringo   eat 1635
# 5    John sleep 1424
# 6  George   run 1092
# 7    Paul  walk 1849
# 8    John   run 1854
# 9  George sleep 1036
# 10  Ringo  walk 1823

Each action can be taken or not taken by a person and actions can be taken in whatever order.

I am interested in summarising the sequence of action in my dataset. In particular for each person I want to find which action was taken first, second, third and fourth. In the event that an action is taken multiple times I am only interested in the first occurrence. Then if someone runs, eats, eats, runs and sleeps I am interested in summarise such as run, eat, sleep.

df.out <- data.frame(who = factor(character(), levels=people),
                     action1 = factor(character(), levels=actions),
                     action2 = factor(character(), levels=actions),
                     action3 = factor(character(), levels=actions),
                     action4 = factor(character(), levels=actions))

I can obtain what I want with a forloop:

for (person in people) {
  tmp <- subset(df.in, who==person)
  tmp <- tmp[order(tmp$when),]
  chrono_list <- unique(tmp$what)
  df.out <- rbind(df.out, data.frame(who = person,
                                     action1 = chrono_list[1],
                                     action2 = chrono_list[2],
                                     action3 = chrono_list[3],
                                     action4 = chrono_list[4]))
}

df.out
#        who action1 action2 action3 action4
#   1   John   sleep     run    <NA>    <NA>
#   2   Paul   sleep     eat    walk    <NA>
#   3  Ringo     eat    walk    <NA>    <NA>
#   4 George   sleep     run    <NA>    <NA>

Can this result be obtained also without a loop in a more efficient fashion?

Kevin M
  • 481
  • 6
  • 20
CptNemo
  • 6,455
  • 16
  • 58
  • 107

4 Answers4

5

We could use dcast from the devel version of data.table, ie. v1.9.5. We can install it from here

library(data.table)#v1.9.5+
dcast(setDT(df.in)[order(when),action:= paste0('action', 1:.N) ,who],
                           who~action, value.var='what')

If you need unique 'what' for each 'who'

dcast(setDT(df.in)[, .SD[!duplicated(what)], who][order(when),
    action:= paste0('action', 1:.N), who], who~action, value.var='what')
#         who action1 action2 action3
#1: George   sleep     run      NA
#2:   John   sleep     run      NA
#3:   Paul   sleep     eat    walk
#4:  Ringo     eat    walk      NA

Or using .I will be a bit more fast

 ind <- setDT(df.in)[,.I[!duplicated(what)], who]$V1 

 dcast(df.in[ind][order(when),action:= paste0('action', 1:.N) ,who], 
            who~action, value.var='what')

Or using setorder and unique which may be a memory efficient as setorder reorder the dataset by reference.

 dcast(unique(setorder(setDT(df.in), who, when), by=c('who', 'what'))[,
     action:= paste0('action', 1:.N), who], who~action, value.var='what')
 #     who action1 action2 action3
 #1: George   sleep     run      NA
 #2:   John   sleep     run      NA
 #3:   Paul   sleep     eat    walk
 #4:  Ringo     eat    walk      NA
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Note that at the time of posting `data.table` is at 1.9.4 in CRAN. So I up-voted you because I like the use of `data.table` especially for the very large dataset the OP has, but he should know it requires a bit of extra work to use 1.9.5+. You might add the code/steps necessary to use `devtools` to [install `data.table` from GitHub](https://github.com/Rdatatable/data.table/wiki/Installation). – Forrest R. Stevens May 13 '15 at 16:29
  • @ForrestR.Stevens Thanks, I will update with the instructions to install – akrun May 13 '15 at 16:30
  • It seems to work also with 1.9.4. Is it better to install 1.9.5 anyhow? – CptNemo May 13 '15 at 16:35
  • @CptNemo I thought we need to use `dcast.data.table` instead of `dcast` on `1.9.4`. I would recommend using the devel version as there are a lot of new functions – akrun May 13 '15 at 16:36
3

You can also you use the combo dplyr + tidyr

library(dplyr)
library(tidyr)

df.in %>%
  group_by(who) %>%
  mutate(when = rank(when), when = paste0("action", when)) %>%
  spread(key = when, value = what)
 ##      who action1 action2 action3 action4
 ## 1 George   sleep     run      NA      NA
 ## 2   John   sleep     run      NA      NA
 ## 3   Paul   sleep     eat     eat    walk
 ## 4  Ringo     eat    walk      NA      NA

EDIT

If you need just the first occurence of the what columns, you can just filter the data first

df.in %>%
  arrange(when) %>%
  group_by(who) %>%
  filter(!duplicated(what)) %>%
  mutate(when = rank(when), when = paste0("action", when)) %>%
  spread(key = when, value = what)
##      who action1 action2 action3
## 1 George   sleep     run      NA
## 2   John   sleep     run      NA
## 3   Paul   sleep     eat    walk
## 4  Ringo     eat    walk      NA
dickoa
  • 18,217
  • 3
  • 36
  • 50
0

I see that you have tagged plyr, but you can also do this with dplyr. Something like the below should work:

df.in %>%
    group_by(who) %>%
    arrange(when) %>%
    summarise(action1 = first(what),
              action2 = nth(what, 2),
              action3 = nth(what, 3),
              action4 = last(what))
Josh W.
  • 1,123
  • 1
  • 10
  • 17
0

Here is a method using a more traditional split-apply-combine. It's more idiomatic R code than the for loop, though {dplyr} and {data.table} solutions seem to more common than this type of {base} R solution. This method uses dcast from {reshape2} but it could also use reshape() for a purely {base} R solution.

This method is likely not much faster than the for loop given in the question. I'd be interested in knowing how the three methods given compare for a large dataset. I'm a beginner and have been working on learning R data manipulation lately. Any feedback is welcome.

library(reshape2)

#Split the data by person and apply the function
actions <- lapply(split(df.in, df.in$who), function(tmp) {

    tmp <- tmp[order(tmp$when),]
    dup <- duplicated(tmp$what)
    df.out <- data.frame(who = tmp$who[!dup], what = tmp$what[!dup])
    df.out$actionNo <- paste("action", c(1:nrow(df.out)), sep = "")
    return(df.out)

})

#Combine the results
act_rbind <- do.call(rbind, actions)
act_cast <- dcast(act_rbind, who ~ actionNo, value.var = "what")
print(act_cast)

    #      who action1 action2 action3
    # 1 George   sleep     run    <NA>
    # 2   John   sleep     run    <NA>
    # 3   Paul   sleep     eat    walk
    # 4  Ringo     eat    walk    <NA>
Kevin M
  • 481
  • 6
  • 20