0

I have the following data from 10 different people.

df = pd.DataFrame({'id':range(1, 11),
                   'x':[.7,-1.6,-.2,-1.2,-.1,3.4,3.7,.8,0,2]})
print(df)
   id    x
0   1  0.7
1   2 -1.6
2   3 -0.2
3   4 -1.2
4   5 -0.1
5   6  3.4
6   7  3.7
7   8  0.8
8   9  0.0
9  10  2.0

I want to calculate a 95 per cent confidence interval for the population mean of df[x].

Since the number of observations is small, the sample mean should follow the t distribution with 10 - 1 degrees of freedom. I tried the following in order to calculate a 95 per cent C.I. using scipy:

# Libraries
import numpy as np
from scipy import stats

# Number of observations
n_obs = 10

# Observed mean
m_obs = df['x'].mean()

# Observed variance (unbiased)
v_obs = df['x'].var(ddof=1) / n_obs

# Declare random variable with observed parameters
t = stats.t(df=n_obs - 1, loc=m_obs, scale=np.sqrt(v_obs))

# Calculate 95% CI
t.interval(alpha=0.95)
> (-0.5297804134938646, 2.0297804134938646) ### Correct interval

This confidence interval is correct. However, I get a completely different result when I manually calculate the interval. What is causing this?

# T such that P(t < T) = 0.975
T = t.ppf(0.975)

# Manually compute interval
(m_obs - (T * np.sqrt(v_obs)), m_obs + (T * np.sqrt(v_obs)))
> (-0.3983168630668432, 1.8983168630668432) ### Incorrect interval
Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
  • Easy StackOverflow points: The difference happens because `t` is not standardized and I am using it to get some `T` that has 97.5% of the distribution to its left. If anyone wants to post this as an answer, I'll gladly accept it. – Arturo Sbr Apr 15 '21 at 18:25

1 Answers1

0

It's been two weeks since I posted the answer in the comments but no one took credit for it, so here it goes:

The reason the two confidence intervals differ is because the value T accumulates 97.5% of the area under the probability distribution of t, and t has mean m_obs and variance v_obs. That is, it is not a standard t distribution.

The correct interval can be calculated correctly by simply making T the value that accumulates 0.975 of the probability of a standard t distribution:

# Number of observations                                      (UNCHANGED)
n_obs = 10

# Observed mean                                               (UNCHANGED)
m_obs = df['x'].mean()

# Observed variance (unbiased)                                (UNCHANGED)
v_obs = df['x'].var(ddof=1) / n_obs

# Declare *STANDARD* t distribution                           (CHANGED!!!)
t = stats.t(df=n_obs - 1, loc=0, scale=1)

# T such that P(x < T) = 0.975                              (CHANGED!!!)
T = t.ppf(0.975)

# Manually compute interval                                   (CORRECT ANSWER)
(m_obs - (T * np.sqrt(v_obs)), m_obs + (T * np.sqrt(v_obs)))
> (-0.5297804134938646, 2.0297804134938646)

This yields the correct answer.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76