4

Hi,I saw someone calculate venn diagram overlap p-value as in the following example. They use hypergeometric distribution and R. When I apply their function in R, I just cannot get the same results. Can anyone help me about this?

The sample I saw in someone else's publication:

From 15220 genes, set A is 1850+195 genes, set B is 195+596 genes, overlap is 195 genes. Their p value is 2e-26.

Their method is: given a total of N genes, if gene sets A and B contain m and n genes, respectively, and k of them are in common, then the p-value of enrichment is calculated by:

p = Σ (m,i)(N-m,n-i)/(N,n)

for i from k to min(m,n), where "(m,i)" represent a binomial form.

And the way I'm using R is:

sum(choose(596+195,195:(195+596))*choose(15220-596-195,(1850+195)-195:(195+596)))/choose(15220,1850+195).

I got NaN.

Or using: phyper(195,1850+195,15220-1850-195,596+195), I got 1.

I also refer from link http://www.pangloss.com/wiki/VennSignificance but when I caculate

1 - phyper(448,1000,13800,2872) in R, I got 0 instead 1.906314e-81 of the link.

I am completely new to R and statistic, sorry for post many mistakes here.

Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
user2700418
  • 43
  • 1
  • 3
  • Do you have a link for the publication? Are you sure that the p-value reported in the paper is for the overlap, or something else? – nograpes Aug 20 '13 at 16:46
  • Yes, the paper link is (on page 3)http://www.pnas.org/content/suppl/2007/10/02/0701014104.DC1/01014SuppText.pdf. Thank you. Also, paper link is http://www.pnas.org/content/104/42/16438.abstract – user2700418 Aug 20 '13 at 18:31

1 Answers1

5

Using package gmp, and replacing choose with chooseZ, we can implement you p-value as:

require(gmp)

enrich_pvalue <- function(N, A, B, k)
{
    m <- A + k
    n <- B + k
    i <- k:min(m,n)

    as.numeric( sum(chooseZ(m,i)*chooseZ(N-m,n-i))/chooseZ(N,n) )
}

Result:

> enrich_pvalue(15220, 1850, 596, 195)
[1] 1.91221e-18

Using the example in your pangloss link (with your notation), we get:

> enrich_pvalue(N=14800, A=1000-448, B=2872-448, k=448)
[1] 7.289388e-81
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
  • Thank you, Ferdinand, for your nice answer, it solves my problem! As a further question, would give me some idea when using other command other then 'choose', like 'phyper'? Is there any universal settings for R to produce more precise numeric results? I really appreciate your time. – user2700418 Aug 20 '13 at 20:34