Cross tabs - need to aggregate several columns based on certain values in other columns

Question

My data looks like this, all columns with binary presence/absence data:

POP1   POP2   POP3    T1    T2    T3    T4    T5    T6    T7    T8    T9
 1      1      0       1     1     1     1     0     1     0     0     1
 1      0      1       0     1     1     0     1     1     0     1     1
 1      1      0       1     1     1     1     0     0     1     0     1
 0      0      0       0     1     1     0     1     0     1     1     0
 1      0      1       0     0     1     1     1     0     1     1     0
 0      1      0       0     1     1     1     0     0     0     0     1
 0      1      0       1     1     0     1     0     0     0     0     0
 1      1      1       0     1     0     0     0     1     0     0     0
 0      0      0       0     1     1     1     1     1     0     0     1
 1      0      0       1     0     1     0     1     0     1     1     1
 1      1      0       0     1     0     1     0     0     1     0     0 
 1      0      1       0     1     1     1     0     1     0     1     0
 0      1      0       1     1     1     1     0     0     0     0     0
 1      0      0       0     1     1     0     0     0     0     1     1

The POP1:POP3 are populations, and I need counts of all 1's for all T1:T9 for all POP1=1, POP2=1 and POP3=1. I need a table that crosstabulates my data like this:

         T1    T2    T3    T4    T5    T6    T7    T8    T9
POP1=1    3     9     7     5     3     4     4     5     5
POP2=1    4     7     8     6     2     3     2     0     3
POP3=1    0     3     4     2     2     2     1     3     1

Don't bother checking the aggregated counts, they're not necessarily correct. I've have tried lots of synthaxes without getting what I want. Thankful for some guidance.

score 4 · Accepted Answer · answered Feb 21 '17 at 22:29

4

You need the matrix multiplication %*% here:

t(df[1:3]) %*% as.matrix(df[4:12]) 

     T1 T2 T3 T4 T5 T6 T7 T8 T9
POP1  3  7  7  5  3  4  4  5  5
POP2  4  7  4  6  0  2  2  0  3
POP3  0  3  3  2  2  3  1  3  1

answered Feb 21 '17 at 22:29

Psidom

209,562
33
339
356

When using this synthax on my real data, where there are a lot more columns I get this error: Error in t(sysrev60[85:103]) %*% as.matrix(sysrev60[40:68]) : requires numeric/complex matrix/vector arguments. Too many columns? – Dag Feb 21 '17 at 22:45
Maybe some of your columns are not numeric, you can check the column classes with `lapply(sysrev60, class)`. – Psidom Feb 21 '17 at 22:48
Just checked. They're all numeric. – Dag Feb 21 '17 at 22:51
Hmm. What is the size of your data? `all(sapply(sysrev60, is.numeric))` – Psidom Feb 21 '17 at 22:52
I now made a new df with only the columns I need, all numeric, as the entire data file has many excess columns. This new column has 19 POP columns and 29 T columns. When running the script t(df.cross[1:19]) %*% as.matrix(df.cross[20:48]) I get this error: Error in t(df.cross[1:19]) %*% as.matrix(df.cross[20:48]) : non-conformable arguments – Dag Feb 21 '17 at 22:58
The error suggests the *ncol* of the matrix at the left operand is not the same as the *nrow* of the matrix at the right operand, but this doesn't make much sense as `ncol(t(df.cross[1:19])) == nrow(df.cross[20:48])` must be true. – Psidom Feb 21 '17 at 23:05
ncol at the left is 48 and nrow on the right is 29. – Dag Feb 21 '17 at 23:09
How is that possible? both of them should be equal to `nrow(df.cross)`. Did you apply the `t()` on the right matrix also? But the formula looks correct to me, if you have copied and pasted the error message. – Psidom Feb 21 '17 at 23:12
This is the synthax I used: t(df.cross[1:19]) %*% as.matrix(df.cross[20:48]). I don't understand why nrow (left) and ncol (right) should be the same when they're counting different dimensions of the matrix. – Dag Feb 21 '17 at 23:17
Yeah, the syntax looks correct to me. `ncol(left) == nrow(right)` that's how matrix is multiplied, taking one row from left and one column from right and calculate the inner product. But I can't really see why it's not working. It seems your data size is not larger either. so it should not be the memory problem. – Psidom Feb 21 '17 at 23:21
I have a problem wrapping my head around this. Could it have something to do with me not needing the POPx=0, but only the POPx=1? Guess not... – Dag Feb 22 '17 at 05:59
Congrats! Good to hear that. – Psidom Feb 26 '17 at 19:06

score 2 · Answer 2 · answered Feb 21 '17 at 22:30

df = structure(list(POP1 = c(1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L), POP2 = c(1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 
0L, 1L, 0L, 1L, 0L), POP3 = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 0L, 1L, 0L, 0L), T1 = c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L), T2 = c(1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 
1L, 0L, 1L, 1L, 1L, 1L), T3 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 
1L, 1L, 0L, 1L, 1L, 1L), T4 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 
1L, 0L, 1L, 1L, 1L, 0L), T5 = c(0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 0L), T6 = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 
1L, 0L, 0L, 1L, 0L, 0L), T7 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 
0L, 1L, 1L, 0L, 0L, 0L), T8 = c(0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 
0L, 1L, 0L, 1L, 0L, 1L), T9 = c(1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 1L)), .Names = c("POP1", "POP2", "POP3", 
"T1", "T2", "T3", "T4", "T5", "T6", "T7", "T8", "T9"), class = "data.frame", 
row.names = c(NA, -14L))

library(reshape2)
df = melt(df, id.vars = colnames(df)[-(1:3)] )

do.call(rbind, lapply(split(df, df$variable), function(x)
                    apply(x[x$value == 1,1:9], 2, function(y) sum(y))))

#     T1 T2 T3 T4 T5 T6 T7 T8 T9
#POP1  3  7  7  5  3  4  4  5  5
#POP2  4  7  4  6  0  2  2  0  3
#POP3  0  3  3  2  2  3  1  3  1

I've tried this and got the error message: Error in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : group length is 0 but data length > 0 — Dag, Feb 22 '17 at 17:44

Cross tabs - need to aggregate several columns based on certain values in other columns

2 Answers2