1

I'm using perl+R to analyze a large dataset of samples. For each two samples, I calculate the t-test p-value. Currently, I'm using the statistics::R module to export values from perl to R, and then use the t.test function. However, this process is extremely slow. I was wondering if someone knows a perl function that will do the same procedure, in a more efficient manner.

Thanks!

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
HEnav
  • 133
  • 1
  • 3
  • 7
  • Potential duplicate of http://stackoverflow.com/questions/823177/how-do-i-calculate-a-p-value-if-i-have-the-t-statistic-and-d-f-in-perl – Vincent Zoonekynd Jan 22 '12 at 12:50

3 Answers3

2

The volume of data, the number of dataset pairs, and perhaps even the code you have written would probably help us identify why your code is slow. For instance, sending many small datasets to R would be slow, but can probably be sped up simply by sending all the data at once.

For a pure Perl solution, you first need to compute the test statistic (that is easy, and already done in Statistics::TTest, for instance), and then to convert it to a p-value (you need something like R's qt function, but I am not sure it is readily available in Perl -- you could send the T-values to R, in one block, at the end, to convert them to p-values).

Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Hi Vincent. My data consists of 650,000 SNPs. for each SNP I create two classes of samples. These classes are sent to R, where the t.test is performed for 45,000 genes. So, I am doing the calculation 650,000x45,000 times... – HEnav Jan 22 '12 at 13:36
  • That is a lot, but you should profile your code, to understand where most of the time is spent: is it the loop in Perl, sending the data to R, retrieving the results, or doing the computations in R? It may be possible to send all the data to R, or to parallelize the computations, or to simplify the computations (the `t.test` function does a lot of things that you do not really need, such as parsing the arguments and computing confidence intervals -- but discarding these parts only provide a 2-fold speed improvement). – Vincent Zoonekynd Jan 22 '12 at 15:05
  • Since the same data (gene or SNP) is used many, many times, you can preprocess each SNP or gene on only keep what is needed to compute the T statistic and the p-value: number of observations, mean and variance for each sample. – Vincent Zoonekynd Jan 22 '12 at 22:45
  • 1
    I hope you correct for multiple testing. Otherwise you will be getting a huge number of false positives. – Thierry Jan 22 '12 at 23:18
0

The Statistics::TTest module gives you a p-value.

use Statistics::TTest;

my @r1 = map { rand(10)   } 1..32;
my @r2 = map { rand(10)-2 } 1..32;

my $ttest = new Statistics::TTest;  
$ttest->load_data(\@r1,\@r2);  
say "p-value = prob > |T| = ", $ttest->{t_prob};

Playing around a bit, I find that the p-values that this gives you are slightly lower than what you get from R. R is apparently doing something that reduces the degrees of freedom, but my knowledge of statistics is insufficient to explain what it's doing or why. (In the above example, the difference is about 1%. If you use samples of 320 floats instead of 32, then the difference is 50% or even more, but it's a difference between 1e-12 and 1.5e-12.) If you need precise p-values, you will want to take care.

flies
  • 2,017
  • 2
  • 24
  • 37
0

You can also try PDL, in particular PDL::Stats.

choroba
  • 231,213
  • 25
  • 204
  • 289