how to create a contingency table for each row of a data frame

Question

I have a large data frame with rows as species and counts from 2 years as columns. I want to create a contingency table for each row in order to test if there was a significant change (decrease) from the first to the second year. Here is similar pretend data:

Species   2016    2017
cat        14      8
dog        16      12
bird       10      5

and then for each row I want a table like:

cat       2017 2018
present   14    8
absent     0    6

dog       2017  2018
present   16    12
absent     0    4

bird      2017  2018
present    10    5
absent      0    5

With this I will then do a Fisher's Exact Test on each table to test if the decrease was significant or not.

I think this can be done with maybe dplyr or apply looping through rows similar to the link below but am unsure how to build the correct list of tables first. How to convert data frame to contingency table in R?

I started with one row at a time:

A <- df[1,1:3]
A[2,] <- 0
A[2,3] <- (A[1,2] - A[1,3])
fisher.test(A[2:3])

Suggestions on how to apply this to a large number of rows would be greatly appreciated! My brain really struggles with coding.

I just added the desired output for all rows. The absent value is the original (2017) count for present minus the 2018 present. So for the cat the 6 comes from 14-8. Thank you. — KNN, Mar 30 '19 at 02:52

Andrew · Answer 1 · 2019-03-30T18:40:20.773

Here is a solution using base R. You can probably use some of the ideas in this answer to make a much more concise answer. Let me know if this works for you!

# Create dataframe
df <- data.frame(Species = c("cat", "dog", "bird"),
                 year_2016 = c(14, 16, 10),
                 year_2017 = c(8, 12, 5), 
                 stringsAsFactors = F)

# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017

# Tranpose the dataframe to use lapply
df_t <- t(df)
colnames(df_t) <- as.vector(df_t[1,])
df_t <- df_t[-1,]
class(df_t) <- "numeric"

# Use lapply to create matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- colnames(df_t)
matrix_list
$cat
     [,1] [,2]
[1,]   14    8
[2,]    0    6

$dog
     [,1] [,2]
[1,]   16   12
[2,]    0    4

$bird
     [,1] [,2]
[1,]   10    5
[2,]    0    5

# Lots of fisher.tests
lapply(matrix_list, fisher.test)
$cat

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.01594
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.516139      Inf
sample estimates:
odds ratio 
       Inf 


$dog

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.1012
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7200866       Inf
sample estimates:
odds ratio 
       Inf 


$bird

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.03251
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.195396      Inf
sample estimates:
odds ratio 
       Inf

And then if you want the p-values you could get them in a vector using sapply:

sapply(tests, "[[", "p.value")
       cat        dog       bird 
0.01594203 0.10122358 0.03250774

EDIT: this is probably a slight improvement. It is a little more concise. I can check how it scales with microbenchmark later today fi you are concerned with performance (or you have a large number of tests to run). Also, remember to penalize those p-values with all those tests ;). Also, @tmfmnk posted a great tidyverse solution if you prefer tidyverse over base.

# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017
df_t <- t(df[-1]) # tranpose dataframe excluding column of species

# Use lapply to create the list of matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- df$Species

# Running the fisher's test on every matrix 
# in the list and extracting the p-values
tests <- lapply(matrix_list, fisher.test)
sapply(tests, "[[", "p.value")
       cat        dog       bird 
0.01594203 0.10122358 0.03250774

Last EDIT. Was able to run them through microbenchmark and wanted to post results for anyone who comes across this post in the future:

Unit: milliseconds

expr           min    lq     mean   median uq     max     neval
tidyverse_sol  12.506 13.497 15.130 14.560 15.827 26.205  100
base_sol       1.120  1.162  1.339  1.225  1.296  5.712   100

I did get your original post to work for me - thank you! I'm going to try the other solutions now too. — KNN, Mar 30 '19 at 14:59
Glad it worked—let me know if you run into any issues or if you have questions about any of the steps! — Andrew, Mar 30 '19 at 15:17
I like the improvements! Thank you. Then I think I can use the tests list for correcting p-values... — KNN, Mar 30 '19 at 17:24

tmfmnk · Accepted Answer · 2019-03-30T15:44:13.043

1

One tidyverse possibility could be:

library(tidyverse)
library(broom)

df %>%
 rowid_to_column() %>%
 gather(var, present, -c(Species, rowid)) %>%
 arrange(rowid, var) %>%
 group_by(rowid) %>%
 mutate(absent = lag(present, default = first(present)) - present) %>%
 ungroup() %>%
 select(-rowid, -var) %>%
 nest(present, absent) %>%
 mutate(p_value = data %>%
         map(~fisher.test(.)) %>%
         map(tidy) %>%
         map_dbl(pluck, "p.value")) %>%
 select(-data)

  Species p_value
  <chr>     <dbl>
1 cat      0.0159
2 dog      0.101 
3 bird     0.0325

Here it, first, performs a wide-to-long data transformation, excluding the columns "Species" and a column referring to row ID. Second, it arranges the data according the row ID and the original column names referring to years and groups by the row ID. Third, it calculates the differences between the years. Finally, it nests the present and absent variables per species and performs the fisher.test, then returns the p-values for each species.

edited Mar 30 '19 at 15:44

answered Mar 30 '19 at 07:32

tmfmnk

38,881
4
47
67

Don't forget to load `broom` as well. I don't think it loads with `library(tidyverse)` – Andrew Mar 30 '19 at 14:22
@Andrew it is part of `tidyverse` actually :) – tmfmnk Mar 30 '19 at 14:31
1

I am away from my computer, so I cannot check, but it doesn’t look like it loads with [tidyverse](https://tidyverse.tidyverse.org). That is, unless tidyverse automatically loads `broom` once it’s installed. I would be surprised but I’ll double check later. – Andrew Mar 30 '19 at 15:22
1

@Andrew now I see. It's part of `tidyverse` packages, however, it doesn't load with it. Thanks for noticing it! – tmfmnk Mar 30 '19 at 15:43
Thank you @tmfmnk. This worked too. I really like tidyverse for many things but I find it has a steep learning curve! – KNN Mar 30 '19 at 17:25

how to create a contingency table for each row of a data frame

2 Answers2