27

I'm looking to generate some statistics about a model I created in python. I'd like to generate the t-test on it, but was wondering if there was an easy way to do this with numpy/scipy. Are there any good explanations around?

For example, I have three related datasets that look like this:

[55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0]

Now, I would like to do the student's t-test on them.

Mark
  • 6,123
  • 13
  • 41
  • 52

3 Answers3

30

In a scipy.stats package there are few ttest_... functions. See example from here:

>>> print 't-statistic = %6.3f pvalue = %6.4f' %  stats.ttest_1samp(x, m)
t-statistic =  0.391 pvalue = 0.6955
van
  • 74,297
  • 13
  • 168
  • 171
  • thanks for responding. it seems to take a random variable. Do i have to generate a random variable out of my sample population beforehand? – Mark Feb 24 '10 at 15:42
  • I think you can just use your sample (not "sample population") – van Feb 24 '10 at 15:54
  • Sample as in one sample value? I was under the impression that I could use a sample of several results as a parameter, but maybe I was mislead :) – Mark Feb 24 '10 at 16:33
  • In statistics, a sample is a subset of a population (see http://en.wikipedia.org/wiki/Sample_%28statistics%29). So what I meant is simple that there is no term "sample population" :) One value is just one value from the sample (which is a set of values). – van Feb 24 '10 at 17:40
11

van's answer using scipy is exactly right and using the scipy.stats.ttest_* functions is very convenient.

But I came to this page looking for a solution with pure numpy, as stated in the heading, to avoid the scipy dependence. To this end, let me point out the example given here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.standard_t.html

The main Problem is, that numpy does not have cumulative distribution functions, hence my conclusion is that you should really use scipy. Anyway, using only numpy is possible:

From the original question I am guessing that you want to compare your datasets and judge with a t-test whether there is a significant deviation? Further, that the samples are paired? (See https://en.wikipedia.org/wiki/Student%27s_t-test#Unpaired_and_paired_two-sample_t-tests ) In that case, you can calculate the t- and p-value like so:

import numpy as np
sample1 = np.array([55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
sample2 = np.array([54.0, 56.0, 48.0, 46.0, 56.0, 56.0, 55.0, 62.0])
# paired sample -> the difference has mean 0
difference = sample1 - sample2
# the t-value is easily computed with numpy
t = (np.mean(difference))/(difference.std(ddof=1)/np.sqrt(len(difference)))
# unfortunately, numpy does not have a build in CDF
# here is a ridiculous work-around integrating by sampling
s = np.random.standard_t(len(difference), size=100000)
p = np.sum(s<t) / float(len(s))
# using a two-sided test
print("There is a {} % probability that the paired samples stem from distributions with the same means.".format(2 * min(p, 1 - p) * 100))

This will print There is a 73.028 % probability that the paired samples stem from distributions with the same means. Since this is far above any sane confidence interval (say 5%), you should not conclude anything for the concrete case.

Echsecutor
  • 775
  • 8
  • 11
-4

Once you get your t-value, you may wonder how to interpret it as a probability -- I did. Here is a function I wrote to help with that.

It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution.

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# of getting the observed values if there is actually no correlation in the given
# direction.
#
# direction:
#  if positive, p is the probability of getting the observed result when there is no
#     positive correlation in the normally distributed full populations sampled by X
#     and Y
#  if negative, p is the probability of getting the observed result, when there is no
#     negative correlation
#  if 0, p is the probability of getting your result, if your hypothesis is true that
#    there is no correlation in either direction
def probabilityOfResult(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)
Joshua Richardson
  • 1,827
  • 22
  • 22
  • 2
    I just wanted to note that the correlation coefficient doesn't have any interpretation as a probability, so this is pretty confused. It's just a measure of linear dependence taking values in the interval [-1,1] – Ben Allison Feb 08 '13 at 10:03
  • Correlation coefficient is clearly related to probability (look at the return values of this function): http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html The stronger the coefficient, the more likely two things are to really be correlated. You could take the correlation as fact if you sampled the complete universe, but if you have a limited sample size, it is only an indication of correlation: a probability. – Joshua Richardson Mar 11 '13 at 20:56
  • The correlation coefficient measures the extent to which one value can be predicted given that the other is known: it is the proportion of variance in one variable explained by the other. Just because it takes values between 0 and 1 (or its absolute values do) doesn't mean it's a probability. Because of this, it doesn't take binary values in the limit, as you suggest: for infinite sample sizes it still takes any value in the interval [-1,1]. Its value indicates the strength of the relationship, which could be weak regardless of sample size. – Ben Allison Mar 12 '13 at 09:57
  • I'm not saying that the correlation coefficient is a probability. The topic of this question is the statistical significance of the correlation coefficient (t-test.) I've provided three references. Here's a fourth reference for you, and it's short and clear. I hope you find the time to read this and understand the relationship between the correlation coefficient and probability before coming back here and rating my answer down yet again: http://sahs.utmb.edu/pellinore/intro_to_research/wad/correlat.htm – Joshua Richardson Mar 12 '13 at 16:58
  • 1
    OK, I see the issue: the p returned by your function is not "probability that there is no correlation". This is a frequent mistake when interpreting a hypothesis test. It is the probability of observing rho=r in a given sample given rho=0 in the population (the null hypothesis). See http://en.wikipedia.org/wiki/P-value#Misunderstandings for a comprehensive breakdown: "The p-value is not the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false. In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses" – Ben Allison Mar 13 '13 at 09:33
  • And apologies for the slightly confrontational tone: it's hard to get pleasantries into a comment of a few hundred characters! – Ben Allison Mar 13 '13 at 09:37
  • Your latest criticism has merit -- I definitely appreciate it! After reading that article, I still think p would represent probability of the null hypothesis "if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily", i.e., in the absence of other information. I realize that the article contradicts my interpretation, but IMO it offers no compelling argument. A "frequentist" may not, but if I go to a casino, and one of the machines pays out three times in a row, you bet I'm dropping another quarter in. – Joshua Richardson Mar 13 '13 at 17:19
  • You may or may not be please to learn that this is one of the most common mistakes in interpreting hypothesis tests. Read this blog post: http://www.statisticalengineering.com/p-values.html, paying particular attention to the parable attributed to Carver (1978), to understand the error. Also see http://www.ma.utexas.edu/users/mks/statmistakes/misundcond.html. Most good statistics text books will cover this issue too! – Ben Allison Mar 14 '13 at 11:29
  • Ok, I fixed the wording. If it's still misleading, please let me know. – Joshua Richardson Apr 04 '13 at 18:06