0

I am new to python and in the following code, I would like to plot a bell curve to show how the data follows a norm distribution. How would I go about it? Also, can anyone answer why when showing the hist, I have values (x-axis) greater than 100? I would assume by defining the Randels to 100, it would not show anything above it. If I am not mistaken, the x-axis represents what "floor" I am in and the y-axis represents how many observations matched that floor. By the way, this is a datacamp project.

"""
Let's say I roll a dice to determine if I go up or down a step in a building with
100 floors (1 step = 1 floor). If the dice is less than 2, I go down a step. If 
the dice is less than or equal to 5, I go up a step, and if the dice is equal to 6,
I go up x steps based on a random integer generator between 1 and 6. What is the probability
I will be higher than floor 60?
"""

import numpy as np
import matplotlib.pyplot as plt

# Set the seed
np.random.seed(123)

# Simulate random walk 
all_walks = []
for i in range(1000) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        if np.random.rand() <= 0.001 : # There's a 0.1% chance I fall and have to start at 0
            step = 0
        random_walk.append(step)
    all_walks.append(random_walk)

# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))

# Select last row from np_aw_t: ends
ends = np_aw_t[-1,:]

# Plot histogram of ends, display plot
plt.hist(ends,bins=10,edgecolor='k',alpha=0.65)
plt.style.use('fivethirtyeight')
plt.xlabel("Floor")
plt.ylabel("# of times in floor")
plt.show()
  • [Fitting a Normal distribution to 1D data](https://stackoverflow.com/questions/20011122/fitting-a-normal-distribution-to-1d-data) I hope you find this answer helpful as well. – r-beginners May 22 '20 at 07:46

1 Answers1

0

You can use scipy.stats.norm to get a normal distribution. Documentation for it here. To fit any function to a data set you can use scipy.optimize.curve_fit(), documentation for that here. My suggestion would be something like the following:

import scipy.stats as ss
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt

#Making a figure with two y-axis (one for the hist, one for the pdf)
#An alternative would be to multiply the pdf by the sum of counts if you just want to show the fit.
fig, ax = plt.subplots(1,1)
twinx = ax.twinx()

rands = ss.norm.rvs(loc = 1, scale = 1, size = 1000)

#hist returns the bins and the value of each bin, plot to the y-axis ax
hist = ax.hist(rands)
vals, bins = hist[0], hist[1]

#calculating the center of each bin
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]

#finding the best fit coefficients, note vals/sum(vals) to get the probability in each bin instead of the count
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )

#loc and scale are mean and standard deviation i believe
loc, scale = coeff

#x-values to plot the normal distribution curve
x = np.linspace(min(bins), max(bins), 100)

#Evaluating the pdf with the best fit mean and std
p = ss.norm.pdf(x, loc = loc, scale = scale)

#plot the pdf to the other axis and show
twinx.plot(x,p)
plt.show()

There are likely more elegant ways to do this, but if you are new to python and are going to use it for calculations and such, getting to know curve_fit and scipy.stats is recomended. I'm not sure I understand whan you mean by "defining the Randels", hist will plot a "standard" histogram with bins on the x-axis and the count in each bin on the y-axis. When using these counts to fit a pdf we can just divide all the counts by the total number of counts.

Hope that helps, just ask if anything is unclear :)

Edit: compact version

vals, bins,_ = ax.hist(my_histogram_data)
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )
x = np.linspace(min(bins), max(bins), 100)
p = ss.norm.pdf(x, loc = coeff[0], scale = coeff[1])
#p is now the fitted normal distribution
  • I tried your code and it indeed plots a norm dist over the histogram. However, is there a way to just fit a norm dist over my existing histogram? I think I've seen it before in R. How does one fit a nom dist based on a given histogram? Also, sorry about the confusion, what I meant is that if you run my code, you will see the x-axis going to 140, but I limited the range to range(100), so why do I see values above it? – Rick Villanueva May 22 '20 at 20:10
  • This is the way i would fit a normal pdf to a histogram (just replace the list "rand" with your data). R is optimized for statistics, so you can't expect python to be as perfect for that kind of job (although there is probably a library for it that i just havent found yet). The part of the code that does the fitting is just 3 lines, edited now to a compact version as well. The x-axis goes up to the highest value in ends unless you use `plt.xlim()` to cut it off, so the explanation for that has to do with calculating the walks. – Vegard Gjeldvik Jervell May 22 '20 at 20:28
  • I ran it with your code now, and noticed that you will get a very poor fit with only 10 bins (as you only get 10 data points), so maybe consider running `hist()` with more bins when you want to generate points for your fit (≈100) gave me a good fit. Then run it one more time to generate your plot. – Vegard Gjeldvik Jervell May 22 '20 at 20:42