Categorical summary statistics for two datasets in one table (comparison)

Question

I'm having issues with producing a table that summarizes two of my datasets and their categorical variables, a table with a format I see quite often in various papers.

The issue is as follows, I have two datasets (one filtered a bit more than the other) and I'd like to show their categorical summary statistics side by side. Using two datasets:

A <- head(mtcars[, c(2, 8:11)])

                  cyl vs am gear carb
Mazda RX4           6  0  1    4    4
Mazda RX4 Wag       6  0  1    4    4
Datsun 710          4  1  1    4    1
Hornet 4 Drive      6  1  0    3    1
Hornet Sportabout   8  0  0    3    2
Valiant             6  1  0    3    1

B <- head(mtcars[3:6, c(2, 8:11)])
                  cyl vs am gear carb
Datsun 710          4  1  1    4    1
Hornet 4 Drive      6  1  0    3    1
Hornet Sportabout   8  0  0    3    2
Valiant             6  1  0    3    1

I'd like to provide summary statistics like so:

                           Table A                  Table B
  Variable Levels Count Column N %  Levels Count Column N %
1       cyl      4     1      16.67      4     1         25
2                6     4      66.67      6     2         50
3                8     1      16.67      8     1         25
4        vs      0     3         50      0     1         25
5                1     3         50      1     3         75
6        am      0     3         50      0     3         75
7                1     3         50      1     1         25
8      gear      3     3         50      3     3         75
9                4     3         50      4     1         25
10     carb      1     3         50      1     3         75
11               2     1      16.67      2     1         25
12               4     2      33.33      4     0          0

I have been able to use the prettyTable.R described in this post on the cars dataset to approach what I want, but I'm having difficulties adjusting this snippet to my needs:

Roman Luštrik:
Here's my solution. It ain't pretty, which is why I put a bag over its head (wrap it in     a function). I also add another variable to demonstrate that it's general (I hope).

prettyTable <- function(x) {

  tbl <- apply(x, 2, function(m) {
    marc <- sort(unique(m))
    cnt <- matrix(table(m), ncol = 1)
    out <- cbind(marc, cnt)
    out <- out[order(marc), ] # do sorting
    out <- cbind(out, round(prop.table(out, 2)[, 2] * 100, 2))
  })

  x2 <- do.call("rbind", tbl)

  spaces <- unlist(lapply(apply(x, 2, unique), length))
  space.names <- names(spaces)
  spc <- rep("", sum(spaces))
  ind <- cumsum(spaces)
  ind <- abs(spaces - ind)+1
  spc[ind] <- space.names

  out <- cbind(spc, x2)
  out <- as.data.frame(out)

  names(out) <- c("Variable", "Levels", "Count", "Column N %")
  out
}

I have been able to do so (partially) by cbinding the output of this prettyTable:

cbind(prettyTable(A)[1:11,],prettyTable(B))

A few issues in this approach: notice the 1:11 part in the first prettyTable: this code is unable to identify that varying number of levels appear in the two datasets. I'm unfortunately not skilled enough to identify what code needs to be added / adjusted in order to do so to obtain the required result without manual edits.

Furthermore, the prettyTable.R snippet does not accept my categorical variables if they are factors, providing me with an error that (I think) refers to the prop.table function in the code. To replicate the situation, add the following code prior to making the prettyTables.

A$cyl <- as.factor(A$cyl)
B$cyl <- as.factor(B$cyl)
prettyTable(A)
Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument

Lastly, the function does not accept one column of data when summarizing. This is not necessarily applicable to my situation, but I'm guessing such a snippet could be useful for other people if it had this functionality as well.

prettyTable(A$cyl)
Error in apply(x, 2, function(m) { : dim(X) must have a positive length

Thank you very much for any help, I keep scratching my head while trying to figure this one out, but I have been unable to do so on my own.

G. Grothendieck · Accepted Answer · 2017-04-12T21:15:42.077

Here are some approaches:

Hmisc See ?summary.formula for more info.

library(Hmisc)
AB <- rbind(cbind(A, Table = "A"), cbind(B, Table = "B"))
s <- summary(Table ~., data = AB, method = "reverse")
print(s, exclude1 = FALSE)

giving:

Descriptive Statistics by Table

+--------+-------+-------+
|        |A      |B      |
|        |(N=6)  |(N=4)  |
+--------+-------+-------+
|cyl : 4 |17% (1)|25% (1)|
+--------+-------+-------+
|    6   |67% (4)|50% (2)|
+--------+-------+-------+
|    8   |17% (1)|25% (1)|
+--------+-------+-------+
|vs : 0  |50% (3)|25% (1)|
+--------+-------+-------+
|    1   |50% (3)|75% (3)|
+--------+-------+-------+
|am : 0  |50% (3)|75% (3)|
+--------+-------+-------+
|    1   |50% (3)|25% (1)|
+--------+-------+-------+
|gear : 3|50% (3)|75% (3)|
+--------+-------+-------+
|    4   |50% (3)|25% (1)|
+--------+-------+-------+
|carb : 1|50% (3)|75% (3)|
+--------+-------+-------+
|    2   |17% (1)|25% (1)|
+--------+-------+-------+
|    4   |33% (2)| 0% (0)|
+--------+-------+-------+

tableone

library(tableone)
AB.fac <- replace(AB, TRUE, lapply(AB, factor)) # AB is from above
tableOne <- CreateCatTable(vars = names(AB)[-6], strata = "Table", data = AB.fac)
print(tableOne, showAllLevels = TRUE, test = FALSE)

giving:

          Stratified by Table
           level A         B        
  n              6         4        
  cyl (%)  4     1 (16.7)  1 (25.0) 
           6     4 (66.7)  2 (50.0) 
           8     1 (16.7)  1 (25.0) 
  vs (%)   0     3 (50.0)  1 (25.0) 
           1     3 (50.0)  3 (75.0) 
  am (%)   0     3 (50.0)  3 (75.0) 
           1     3 (50.0)  1 (25.0) 
  gear (%) 3     3 (50.0)  3 (75.0) 
           4     3 (50.0)  1 (25.0) 
  carb (%) 1     3 (50.0)  3 (75.0) 
           2     1 (16.7)  1 (25.0) 
           4     2 (33.3)  0 ( 0.0)

REVISED Added tableone solution.

Do you have any idea on a solution in which the percentages & numbers are in separate columns? — MattV, Apr 06 '14 at 13:19
The data can be extracted from the `tableOne` object: `cbind(do.call(rbind, tableOne$A)[4:6], do.call(rbind, tableOne$B)[4:6])` — G. Grothendieck, Apr 06 '14 at 13:31

Categorical summary statistics for two datasets in one table (comparison)

1 Answers1