1

I'm trying to calculate a p-value for my metric (spearman) that I'm trying to generalize the method so it can work with other metrics (instead of relying on scipy.stats.spearmanr).

How can I generate a p-value of a point from this distribution?

Does the method apply to non-normal distributions? This is normally distributed and would probaly be more-so if I sampled more than 100 points.

This post requires µ=0 ,std=1 Convert Z-score (Z-value, standard score) to p-value for normal distribution in Python

from scipy import stats
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

data = np.asarray([0.027972027972027972, -0.2802197802197802, -0.21818181818181817, 0.3464285714285714, 0.15, 0.34065934065934067, -0.3216783216783217, 0.08391608391608392, -0.03496503496503497, -0.2967032967032967, 0.09090909090909091, 0.11188811188811189, 0.1181818181818182, -0.4787878787878788, -0.6923076923076923, -0.05494505494505495, 0.19090909090909092, 0.3146853146853147, -0.42727272727272725, 0.06363636363636363, 0.1978021978021978, 0.12142857142857141, 0.10303030303030303, 0.23214285714285712, -0.5804195804195805, 0.013986013986013986, 0.02727272727272727, 0.5659340659340659, 0.06363636363636363, -0.503030303030303, -0.2867132867132867, 0.07252747252747253, -0.13736263736263737, 0.21212121212121213, -0.09010989010989011, -0.2517482517482518, -0.17482517482517484, -0.3706293706293707, 0.15454545454545454, 0.01818181818181818, 0.17582417582417584, 0.3230769230769231, -0.09642857142857142, -0.5274725274725275, -0.23626373626373626, -0.2692307692307692, -0.2857142857142857, -0.19999999999999998, -0.489010989010989, -0.15454545454545454, 0.38461538461538464, 0.6, 0.37762237762237766, -0.0029411764705882353, -0.06993006993006994, -0.19999999999999998, 0.38181818181818183, 0.05454545454545455, -0.03296703296703297, 0.17272727272727273, -0.13986013986013987, -0.08241758241758242, -0.34545454545454546, 0.5252747252747253, 0.10303030303030303, 0.16783216783216784, -0.36363636363636365, -0.42857142857142855, 0.12727272727272726, -0.18181818181818182, -0.10439560439560439, -0.6083916083916084, -0.1956043956043956, 0.13846153846153847, -0.48951048951048953, -0.18881118881118883, 0.7362637362637363, -0.19090909090909092, 0.4909090909090909, 0.37142857142857144, -0.3090909090909091, -0.1098901098901099, 0.15151515151515152, -0.13636363636363635, -0.5494505494505495, 0.44755244755244755, 0.04895104895104896, -0.37142857142857144, 0.01098901098901099, 0.08131868131868132, 0.2571428571428571, -0.3076923076923077, 0.24545454545454545, 0.06043956043956044, 0.06764705882352941, 0.02727272727272727, -0.07252747252747253, 0.21818181818181817, -0.03846153846153846, 0.48571428571428577])
query_value = -0.44155844155844154

with plt.style.context("seaborn-white"):
    fig, ax = plt.subplots()
    sns.distplot(data, rug=True, color="teal", ax=ax)
    ax.set_xlabel("$x$", fontsize=15)
    ax.axvline(query_value, color="black", linestyle=":", linewidth=1.618, label="Query: %0.5f"%query_value)
    ax.legend()

# Normal Test
print(stats.normaltest(data))
# Fit the data
params = stats.norm.fit(data)
# Generate the distribution
distribution = stats.norm(*params)
distribution

enter image description here

B--rian
  • 5,578
  • 10
  • 38
  • 89
O.rka
  • 29,847
  • 68
  • 194
  • 309
  • I can't see from your text, what is the point value, mean and sd of the normal distribution? Also, are you trying to estimate the density or the cumulative? – user2974951 Aug 23 '19 at 10:27
  • These are correlation values ranging from -1 to 1. I am trying to determine if my point value is statistically significant. My understanding is that I measure the number of occurrences below my value (-0.44) and above (0.44) then divide this by the total number of permutations (N=100). However, this would be a probability. Is it possible to use the scipy distribution and a point value to determine whether or not a value is signficant? – O.rka Aug 23 '19 at 18:24
  • You mentioned some new piece of information now that you did not include in the question, is this a permutation test? – user2974951 Aug 24 '19 at 06:27
  • Yes it is. Sorry I left that out. – O.rka Aug 25 '19 at 03:07

1 Answers1

0

Based on your comments I am going to assume that these are the results from a permutation test. That is, you obtained a value from your original data set (-0.44), while all the other values were obtained by permuting your data. Now you would like to determine whether your original value is significant.

A permutation test (in the branch of resampling) is a non-parametric statistic, so it has nothing to do with normal distributions. In your case it looks roughly normal but that is neither necessary nor required. There are different ways you could estimate a p-value from the permuted distribution, the simplest option is similar to your idea.

In case you performed every possible permutation you would get an exact distribution, so your formula for a (two-sided) p-value is correct, (|t*|>=|t|)/p, where t* is the original value, t are the permuted values, and p is the number of total permutations.

If you performed a non-complete number of permutations then the formula is only slightly different, (1+|t*|>=|t|)/(1+p), to account for the randomness.

user2974951
  • 9,535
  • 1
  • 17
  • 24
  • So if I was doing a similar permutation test with euclidean distance as the metric to determine if a certain distance was significantly small, would it be a one-sided test where I'm only checking the proportion of the null values being lower? – O.rka Aug 26 '19 at 19:08
  • Thank you! This is really useful. So if I fit my null values to a distribution such as the one above, is there a way to calculate a p-value from the actual scipy distribution? Or is this not possible and should/can only be done using the permutation test you described? – O.rka Aug 27 '19 at 18:45
  • @O.rka I don't really understand what you are asking, but if this is about Python then I cannot help you there. – user2974951 Aug 28 '19 at 08:55