Better way to produce data frame using table()

Question

Recently, I have found that I am using the following pattern over and over again. The process is:

cross-tabulate numeric variable by factor using table
create data frame from created table
add original numeric values to data frame (from row names (!))
remove row names
reorder columns of aggregated data frame

In R, it looks like this:

# Sample data
df <- data.frame(x = round(runif(100), 1), 
                 y = factor(ifelse(runif(100) > .5, 1, 0), 
                            labels = c('failure', 'success')) 
                )

# Get frequencies
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
# Add column of original values from rownames
dfSummary$x <- as.numeric(rownames(dfSummary))
# Remove rownames
rownames(dfSummary) <- NULL
# Reorder columns
dfSummary <- dfSummary[, c(3, 1, 2)]

Is there anything more elegant in R, preferably using base functions? I know I can use sql to do this in single command - I think that it has to be possible to achieve similar behavior in R.

sqldf solution:

library(sqldf)
dfSummary <- sqldf("select 
                     x, 
                     sum(y = 'failure') as failure,
                     sum(y = 'success') as success
                    from df group by x")

I think you can just wrap everything to some function called `mytable` and run everything in single command. There is no better alternative whatsoever to `table` in base R. — David Arenburg, Jan 18 '15 at 20:10
If you want a neat solution (not with base R though), with `data.table` you can keep row names as a column when converting, something like `library(data.table) ; setDT(as.data.frame.matrix(table(df$x, df$y)), keep.rownames = TRUE)[]` — David Arenburg, Jan 18 '15 at 20:57
@DavidArenburg I would also use `dcast.data.table(setDT(df1), x~y)` — akrun, Jan 19 '15 at 03:44
@hadley I wrote "preferably" and not "only base R required". There is a reason for this - it is extremely easy to wrap this pattern in a function. However, doing so does not make it a good solution automatically. Feel free to add another answer if you think there is better (more reliable/faster) solution using something else. — Tomas Greif, Jan 19 '15 at 20:33
@TomasGreif If you are looking for faster solution, `dcast.data.table` would be faster as well as compact. — akrun, Jan 20 '15 at 18:15

score 8 · Answer 1 · answered Jan 18 '15 at 19:31

An alternative with base R could be:

aggregate(. ~ x, transform(df, success = y == "sucess", 
                               failure = y == "failure", y = NULL), sum)
#     x success failure
#1  0.0       2       4
#2  0.1       6       8
#3  0.2       1       7
#4  0.3       5       4
#5  0.4       6       6
#6  0.5       3       3
#7  0.6       4       6
#8  0.7       6       6
#9  0.8       4       5
#10 0.9       6       7
#11 1.0       1       0

akrun · Accepted Answer · 2015-01-20T18:22:37.867

Your code modified as a function would be efficient compared to the other solutions in base R (so far). If you wanted the code in one-line, a "reshape/table" combo from base R could be used.

reshape(as.data.frame(table(df)), idvar='x', timevar='y',
        direction='wide')
#     x Freq.failure Freq.success
#1    0            3            2
#2  0.1            3            9
#3  0.2            5            5
#4  0.3            8            7
#5  0.4            5            3
#6  0.5            9            4
#7  0.6            3            6
#8  0.7            7            6
#9  0.8            3            1
#10 0.9            4            3
#11   1            0            4

In case you want to try data.table

library(data.table)
dcast.data.table(setDT(df), x~y)
#          x failure success
# 1: 0.0       3       2
# 2: 0.1       3       9
# 3: 0.2       5       5
# 4: 0.3       8       7
# 5: 0.4       5       3
# 6: 0.5       9       4
# 7: 0.6       3       6
# 8: 0.7       7       6
# 9: 0.8       3       1
#10: 0.9       4       3
#11: 1.0       0       4

Update

I didn't notice the as.data.frame(table( converts to "factor" columns (thanks to @Hadley's comment). A workaround is:

res <- transform(reshape(as.data.frame(table(df), stringsAsFactors=FALSE),
     idvar='x', timevar='y', direction='wide'), x= as.numeric(x))

data

set.seed(24)
df <- data.frame(x = round(runif(100), 1), 
             y = factor(ifelse(runif(100) > .5, 1, 0), 
                        labels = c('failure', 'success')) 
            )

Benchmarks

set.seed(24)
df <- data.frame(x = round(runif(1e6), 1), 
             y = factor(ifelse(runif(1e6) > .5, 1, 0), 
                        labels = c('failure', 'success')) 
            )

tomas <- function(){
   dfSummary <- as.data.frame.matrix(table(df$x, df$y))
   dfSummary$x <- as.numeric(rownames(dfSummary))
   dfSummary <- dfSummary[, c(3, 1, 2)]}


 doc <- function(){aggregate(. ~ x, transform(df,
        success = y == "success", failure = y == "failure",
                     y = NULL), sum)}

 akrun <- function(){reshape(as.data.frame(table(df)),
             idvar='x', timevar='y', direction='wide')}

library(microbenchmark)
 microbenchmark(tomas(), doc(), akrun(), unit='relative', times=20L)
 Unit: relative
 #expr       min         lq      mean    median         uq       max neval cld
 #tomas()  1.000000  1.0000000  1.000000  1.000000  1.0000000  1.000000    20  a 
 #doc()   13.451037 11.5050997 13.082074 13.043584 12.8048306 19.715535    20   b
 #akrun()  1.019977  0.9522809  1.012332  1.007569  0.9993835  1.533191    20  a

Updated with dcast.data.table

df1 <- copy(df)
akrun2 <- function() {dcast.data.table(setDT(df1), x~y)}
microbenchmark(tomas(), akrun2(), unit='relative', times=20L)
#   Unit: relative
# expr      min       lq     mean  median       uq      max neval cld
# tomas() 6.493231 6.345752 6.410853 6.51594 6.502044 5.591753    20   b
# akrun2() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000    20  a

@hadley But my other method won't change it to factor `dcast.data.table(..)` I didn't notice it though, thanks. — akrun, Jan 19 '15 at 14:19
The workaround still converts it to character and then converts it back, which is obviously a pain - I don't think there is a way to do this correctly with a simple combination of base functions. — hadley, Jan 19 '15 at 14:28
Right, I get that - there are lots of things you can't do easily and correctly with base R. — hadley, Jan 19 '15 at 14:34

IRTFM · Answer 3 · 2015-01-19T06:11:04.137

This should be relatively efficient. You cannot really suppress rownames in a dataframe, since they are a requirement of a valid dataframe

X <- table(df$x,df$y)
cbind( data.frame(x=rownames(X)), unclass(X) )
      x failure success
0     0       5       3
0.1 0.1       6       1
0.2 0.2       7       8
0.3 0.3       7       3
0.4 0.4       6       6
0.5 0.5       6       4
0.6 0.6       2       5
0.7 0.7       2       7
0.8 0.8       3       7
0.9 0.9       4       6
1     1       2       0

Better way to produce data frame using table()

3 Answers3

Update

data

Benchmarks