0

The problem is with the resultant graph of function scipy.stats.probplot(). Samples from a normal distribution doesn't produce a line as expected.

I am trying to normalize some data using graphs as guidance.

However, after some strange results showing that zscore and log transformations were having no effect, I started looking for something wrong.

So, I built a graph using synthetic values that has a norm distribution and the resultant graph seems very awkward.

Here is the steps to reproduce the array and the graph:

import math
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
norm = stats.norm.pdf(x, mu, sigma)

plt.plot(x, norm)
plt.show()
_ = stats.probplot(norm, plot=plt, sparams=(0, 1))
plt.show()

Distribution curve:

Distribution curve

Probability plot:

Probability plot

tdy
  • 36,675
  • 19
  • 86
  • 83
  • I don't see the problem here. The distribution curve looks like a standard normal probability density function and the probplot looks like a standard normal cumulative distribution function. Have I misunderstood your question? – Michael Ruth Nov 23 '22 at 18:22
  • Look at this notebook: https://www.kaggle.com/code/serigne/stacked-regressions-top-4-on-leaderboard There, the author plotted the probabilities of the normalized 'SalesPrice' and the probabilites formed a straight line, not a curve like my data which also has a normal distribution. – Márcio A. Freitas Jr Nov 25 '22 at 12:24
  • Your synthesized data aren't normally distributed, they are uniformly distributed, this is what `numpy.linspace()` does. You can visualize this by adding `seaborn.distplot(x, fit=scipy.stats.norm)`. Try synthesizing your data with `numpy.random.normal()`, this is its specific purpose. When I use `numpy.random.normal()` and plug it into the code in the kaggle link, I get figures which look like those in the kaggle link. – Michael Ruth Nov 26 '22 at 17:14

1 Answers1

0

Your synthesized data aren't normally distributed, they are uniformly distributed, this is what numpy.linspace() does. You can visualize this by adding seaborn.distplot(x, fit=scipy.stats.norm).

import math

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns


mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
y = stats.norm.pdf(x, mu, sigma)

sns.distplot(y, fit=stats.norm)
fig = plt.figure()
res = stats.probplot(y, plot=plt, sparams=(0, 1))
plt.show()

Try synthesizing your data with numpy.random.normal(). This will give you normally distributed data.

import math

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns


mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.random.normal(loc=mu, scale=sigma, size=100)

sns.distplot(x, fit=stats.norm)
fig = plt.figure()
res = stats.probplot(x, plot=plt, sparams=(0, 1))
plt.show()
Michael Ruth
  • 2,938
  • 1
  • 20
  • 27