4

A homework question asked me to calculate a confidence interval for a mean. When I did it the traditional method and with numpy.percentile() -- I got different answers.

I think that I may be misunderstanding how or when to use np.percentile(). My two questions are: 1. Am I using it wrong -- wrong inputs, etc. 2. Am I using it in the wrong place - should use for bootstrap CIs and not conventional methods?

I've calculated the CI by the traditional formula and np.percentile()


price = np.random.normal(11427, 5845, 30)
# u = mean of orginal vector
# s = std of original vector
print(price)

[14209.99205723 7793.06283131 10403.87407888 10910.59681669 14427.87437741 4426.8122023 13890.22030853 5652.39284669 22436.9686157 9591.28194843 15543.24262609 11951.15170839 16242.64433138 3673.40741792 18962.90840397 11320.92073514 12984.61905211 8716.97883291 15539.80873528 19324.24734807 12507.9268783 11226.36772026 8869.27092532 9117.52393498 11786.21064418 11273.61893921 17093.20022578 10163.75037277 13962.10004709 17094.70579814]

x_bar = np.mean(price) # mean of vector
s = np.std(price) # std of vector
n = len(price) # number of obs
z = 1.96 # for a 95% CI

lower = x_bar - (z * (s/math.sqrt(n)))
upper = x_bar + (z * (s/math.sqrt(n)))
med = np.median(price)

print(lower, med, upper)

10838.458908888499 11868.68117628698 13901.386475143861

np.percentile(price, [2.5, 50, 97.5])

[ 4219.6258866 11868.68117629 20180.24569667]

ss.scoreatpercentile(price, [2.5, 50, 97.5])

[ 4219.6258866 11868.68117629 20180.24569667]

I would expect the lower, med and upper to equal the output of np.percentile().

While the median values are the same -- the upper and lower are quite a bit off of each other.

Moreover, scipy.stats.percentile gives the same output as numpy.percentile.

Any thoughts?

Thanks!

Edited to show the price vector.

SherbertTheCat
  • 655
  • 2
  • 7
  • 9
  • could you please provide the array `price`? – kmario23 Apr 25 '19 at 21:21
  • @kmario23 I edited it to 'show' the price array. It was a column from a DF, but I just made a random normal vector with its parameters. The error is still there and still quite large. Any help would be great! – SherbertTheCat Apr 25 '19 at 21:57
  • You will get a much better explanation of confidence interval vs percentile than I can give over at https://stats.stackexchange.com/ – danielR9 Apr 25 '19 at 22:37

1 Answers1

2

A confidence interval and a percentile are not the same thing. The formulas for the two things are very different

The number of samples you have is going to affect your confidence interval, but won't change (much) the percentiles.

e.g.

price = np.random.normal(0, 1, 10000)
print (np.percentile(price, [2.5, 50, 97.5])

gives

[-1.97681778  0.01808908  1.93659551]

and

price = np.random.normal(0, 1, 100000000)
print (np.percentile(price, [2.5, 50, 97.5]))

gives pretty much the same:

[-1.96012643  9.82108813e-05  1.96030460]

But running your CI calculation code, if you increase the number of samples massively, your confidence interval will shrink - because you are now 95% confident that the mean of the distribution lies within a smaller range.

Using the same 2 price arrays (mean=0, sd =1) with 10 samples and 10,000 samples your results are:

-0.5051688819759096 0.17504324224822834 0.744716862363091 # 10 samples
-0.02645090158517636 -0.006759616493022626 0.012353106820212557 # 10000 samples

As you can see, CI is much smaller with more samples (as you would expect, given the formula for CI!)

danielR9
  • 435
  • 3
  • 9