2

I apologize if this is a very naive question...

I have 7000 2x4 contingency tables with count data. They represent a particular position in a genome and the number of times each dna nucleotide is observed at that position in 2 different environments. an example contingency table would be

            A      C      G      T 
condition1  0      2      20     70000
condition2  3      15     0      95000

or
            A      C     G       T 
condition1  80146  0     5       0
condition2  26821  2     4       0

The data can only be positive integers. Minimum counts are 0 and maximum can go up to ~800,000. One count is generally nearly all of the total counts for that row and column (e.g. the same in both conditions, for example cell T in the first case above and cell A in the second), and then 1 or 2 other cells will have low counts... it is in these other cells where the difference, if any, should be observed.

The goal is to identify the positions which are significantly different between these 2 environmental conditions to further analyze. Our measurement method is estimated to have an error rate of 10^-6.

I am using R to analyze this data. I am not sure I can run a chi square test on this because of having cells with small or 0 counts. With the fisher's test I get 2 errors:

with a workspace of 1E5 
FEXACT error 40.
Out of workspace.

with a workspace of >3E5
FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.

Can anyone suggest an appropriate test, or setting for the fisher or chi square?

Many thanks in advance,

Ron

Ron
  • 25
  • 5
  • To make the question clearer, you can give names to columns, tell what values can come in each column and give 2 example tables. This will help members of the forum to help you. – rnso Sep 14 '14 at 07:15
  • Just did. Hope this makes more sense now. – Ron Sep 14 '14 at 07:35
  • "1 or 2 other cells will have low counts... it is in these other cells where the difference, if any, should be observed.": what do you mean by low: will taking 100 as cutoff be OK? – rnso Sep 14 '14 at 08:14
  • 100 may be high... maybe 30 or 50 is a better cutoff. In theory the error rate if 1e-6, and the median total per row is around 50,000-150,000 events (n), so observing even few events should be above the error. – Ron Sep 14 '14 at 08:41

2 Answers2

0

Fisher's exact test in R only works on smaller data. If you reduce the data in column of T from 70000 and 95000 to 700 and 950, the Fisher test will work.

Meanwhile, I tried chisq.test on your data and it worked. For larger data, chi-square test is preferred over Fisher's exact test.

ArnonZ
  • 3,822
  • 4
  • 32
  • 42
  • Hi, I am not sure if I run into problems if I scale the data down by 10, since its not possible to scale the 0 values and getting a 0 when n is 100,000 shouldn't mean the same as when its 10,000. As far as the chisq.test, as below, I am not sure if its ok to use it with cells that contain less than 5 and I get a warning message that the approximation may be incorrect. – Ron Sep 14 '14 at 09:53
  • Hi, Ron, I am not an expert in statistics. You may check this page out http://www.langsrud.com/fisher.htm. I think as long as chisq.test works on your data, you should not worry too much. – Kai Sun Sep 14 '14 at 10:19
0

Chi-square test works:

df1 = structure(list(A = c(0L, 3L), C = c(2L, 15L), G = c(20L, 0L), 
    T = c(70000L, 95000L)), .Names = c("A", "C", "G", "T"), class = "data.frame", row.names = 1:2)

df1
  A  C  G     T
1 0  2 20 70000
2 3 15  0 95000

chisq.test(df1)

        Pearson's Chi-squared test

data:  df1
X-squared = 35.8943, df = 3, p-value = 7.884e-08

Warning message:
In chisq.test(df1) : Chi-squared approximation may be incorrect

I am not sure if this is sufficient.

rnso
  • 23,686
  • 25
  • 112
  • 234
  • Is it ok to do a chi.square when the values of some cells are 0 or below 5? is this where the warning is coming from? – Ron Sep 14 '14 at 08:51
  • I would agree with KaiSun and ignore this warning. fisher.test gives an error, which is all the more reason that you should use chi-square test. For statistical advice, you should post at http://stats.stackexchange.com/ (CrossValidated). – rnso Sep 14 '14 at 11:18
  • Thanks everyone. I posted it on the stats exchange to ensure the its the right test and its ok to ignore the warning. – Ron Sep 14 '14 at 12:37