9

Recently, I have found that I am using the following pattern over and over again. The process is:

  • cross-tabulate numeric variable by factor using table
  • create data frame from created table
  • add original numeric values to data frame (from row names (!))
  • remove row names
  • reorder columns of aggregated data frame

In R, it looks like this:

# Sample data
df <- data.frame(x = round(runif(100), 1), 
                 y = factor(ifelse(runif(100) > .5, 1, 0), 
                            labels = c('failure', 'success')) 
                )

# Get frequencies
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
# Add column of original values from rownames
dfSummary$x <- as.numeric(rownames(dfSummary))
# Remove rownames
rownames(dfSummary) <- NULL
# Reorder columns
dfSummary <- dfSummary[, c(3, 1, 2)]

Is there anything more elegant in R, preferably using base functions? I know I can use sql to do this in single command - I think that it has to be possible to achieve similar behavior in R.

sqldf solution:

library(sqldf)
dfSummary <- sqldf("select 
                     x, 
                     sum(y = 'failure') as failure,
                     sum(y = 'success') as success
                    from df group by x")
Tomas Greif
  • 21,685
  • 23
  • 106
  • 155
  • I think you can just wrap everything to some function called `mytable` and run everything in single command. There is no better alternative whatsoever to `table` in base R. – David Arenburg Jan 18 '15 at 20:10
  • 3
    If you want a neat solution (not with base R though), with `data.table` you can keep row names as a column when converting, something like `library(data.table) ; setDT(as.data.frame.matrix(table(df$x, df$y)), keep.rownames = TRUE)[]` – David Arenburg Jan 18 '15 at 20:57
  • 1
    @DavidArenburg I would also use `dcast.data.table(setDT(df1), x~y)` – akrun Jan 19 '15 at 03:44
  • Restricting to only base R is a bit silly. – hadley Jan 19 '15 at 14:35
  • @hadley I wrote "preferably" and not "only base R required". There is a reason for this - it is extremely easy to wrap this pattern in a function. However, doing so does not make it a good solution automatically. Feel free to add another answer if you think there is better (more reliable/faster) solution using something else. – Tomas Greif Jan 19 '15 at 20:33
  • 1
    @TomasGreif If you are looking for faster solution, `dcast.data.table` would be faster as well as compact. – akrun Jan 20 '15 at 18:15

3 Answers3

8

An alternative with base R could be:

aggregate(. ~ x, transform(df, success = y == "sucess", 
                               failure = y == "failure", y = NULL), sum)
#     x success failure
#1  0.0       2       4
#2  0.1       6       8
#3  0.2       1       7
#4  0.3       5       4
#5  0.4       6       6
#6  0.5       3       3
#7  0.6       4       6
#8  0.7       6       6
#9  0.8       4       5
#10 0.9       6       7
#11 1.0       1       0
talat
  • 68,970
  • 21
  • 126
  • 157
6

Your code modified as a function would be efficient compared to the other solutions in base R (so far). If you wanted the code in one-line, a "reshape/table" combo from base R could be used.

reshape(as.data.frame(table(df)), idvar='x', timevar='y',
        direction='wide')
#     x Freq.failure Freq.success
#1    0            3            2
#2  0.1            3            9
#3  0.2            5            5
#4  0.3            8            7
#5  0.4            5            3
#6  0.5            9            4
#7  0.6            3            6
#8  0.7            7            6
#9  0.8            3            1
#10 0.9            4            3
#11   1            0            4

In case you want to try data.table

library(data.table)
dcast.data.table(setDT(df), x~y)
#          x failure success
# 1: 0.0       3       2
# 2: 0.1       3       9
# 3: 0.2       5       5
# 4: 0.3       8       7
# 5: 0.4       5       3
# 6: 0.5       9       4
# 7: 0.6       3       6
# 8: 0.7       7       6
# 9: 0.8       3       1
#10: 0.9       4       3
#11: 1.0       0       4

Update

I didn't notice the as.data.frame(table( converts to "factor" columns (thanks to @Hadley's comment). A workaround is:

res <- transform(reshape(as.data.frame(table(df), stringsAsFactors=FALSE),
     idvar='x', timevar='y', direction='wide'), x= as.numeric(x))

data

set.seed(24)
df <- data.frame(x = round(runif(100), 1), 
             y = factor(ifelse(runif(100) > .5, 1, 0), 
                        labels = c('failure', 'success')) 
            )

Benchmarks

set.seed(24)
df <- data.frame(x = round(runif(1e6), 1), 
             y = factor(ifelse(runif(1e6) > .5, 1, 0), 
                        labels = c('failure', 'success')) 
            )

tomas <- function(){
   dfSummary <- as.data.frame.matrix(table(df$x, df$y))
   dfSummary$x <- as.numeric(rownames(dfSummary))
   dfSummary <- dfSummary[, c(3, 1, 2)]}


 doc <- function(){aggregate(. ~ x, transform(df,
        success = y == "success", failure = y == "failure",
                     y = NULL), sum)}

 akrun <- function(){reshape(as.data.frame(table(df)),
             idvar='x', timevar='y', direction='wide')}

library(microbenchmark)
 microbenchmark(tomas(), doc(), akrun(), unit='relative', times=20L)
 Unit: relative
 #expr       min         lq      mean    median         uq       max neval cld
 #tomas()  1.000000  1.0000000  1.000000  1.000000  1.0000000  1.000000    20  a 
 #doc()   13.451037 11.5050997 13.082074 13.043584 12.8048306 19.715535    20   b
 #akrun()  1.019977  0.9522809  1.012332  1.007569  0.9993835  1.533191    20  a 

Updated with dcast.data.table

df1 <- copy(df)
akrun2 <- function() {dcast.data.table(setDT(df1), x~y)}
microbenchmark(tomas(), akrun2(), unit='relative', times=20L)
#   Unit: relative
# expr      min       lq     mean  median       uq      max neval cld
# tomas() 6.493231 6.345752 6.410853 6.51594 6.502044 5.591753    20   b
# akrun2() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000    20  a 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Using `table()` + `as.data.frame()` turns `x` into a factor – hadley Jan 19 '15 at 14:15
  • @hadley But my other method won't change it to factor `dcast.data.table(..)` I didn't notice it though, thanks. – akrun Jan 19 '15 at 14:19
  • The workaround still converts it to character and then converts it back, which is obviously a pain - I don't think there is a way to do this correctly with a simple combination of base functions. – hadley Jan 19 '15 at 14:28
  • @hadley But the OP prefers only base functions :-) – akrun Jan 19 '15 at 14:29
  • Right, I get that - there are lots of things you can't do easily and correctly with base R. – hadley Jan 19 '15 at 14:34
3

This should be relatively efficient. You cannot really suppress rownames in a dataframe, since they are a requirement of a valid dataframe

X <- table(df$x,df$y)
cbind( data.frame(x=rownames(X)), unclass(X) )
      x failure success
0     0       5       3
0.1 0.1       6       1
0.2 0.2       7       8
0.3 0.3       7       3
0.4 0.4       6       6
0.5 0.5       6       4
0.6 0.6       2       5
0.7 0.7       2       7
0.8 0.8       3       7
0.9 0.9       4       6
1     1       2       0
IRTFM
  • 258,963
  • 21
  • 364
  • 487