0

I was trying to simulate "Sampling Distribution of Sample Proportions" using Python. I tried with a Bernoulli Variable as in example here

The crux is that, out of large number of gumballs, we have yellow balls with true proportion of 0.6. If we take samples (of some size, say 10), take mean of that and plot, we should get a normal distribution.

I tried to do in python but I only always get uniform distribution (or flats out in the middle). I am not able to understand what am I missing.

Program:

from SDSP import create_bernoulli_population, get_frequency_df
from random import shuffle, choices
from bi_to_nor_demo import get_metrics, bare_minimal_plot
import matplotlib.pyplot as plt


N = 10000  # 10000 balls
p = 0.6    # probability of yellow ball is 0.6, and others (1-0.6)=>0.4
n_pickups = 1000       # sample size
n_experiments = 100  # I dont know what this is called 


# generate population
population = create_bernoulli_population(N,p)
theor_df = get_frequency_df(population)
theor_df

# choose sample, take mean and add to X_mean_list. Do this for n_experiments times
X_hat = []
X_mean_list = []
for each_experiment in range(n_experiments):
    X_hat = choices(population, k=n_pickups)  # this method is with replacement
    shuffle(population)
    X_mean = sum(X_hat)/len(X_hat)
    X_mean_list.append(X_mean)

# plot X_mean_list as bar graph
stats_df = get_frequency_df(X_mean_list)
fig, ax = plt.subplots(1,1, figsize=(5,5))
X = stats_df['x'].tolist()
P = stats_df['p(x)'].tolist()    
ax.bar(X, P, color="C0") 

plt.show()

Dependent functions:
bi_to_nor_demo
SDSP

Output:
enter image description here

Update: I even tried uniform distribution as below but getting similar output. Not converging to normal :(. (using below function in place of create_bernoulli_population)

def create_uniform_population(N, Y=[]):
    """
    Given the total size of population N, 
    this function generates list of those outcomes uniformly distributed
    population list
    N - Population size, eg N=10000
    p - probability of interested outcome  
    Returns the outcomes spread out in population as a list
    """
    uniform_p = 1/len(Y)
    print(uniform_p)
    total_pops = []
    for i in range(0,len(Y)):
        each_o = [i]*(int(uniform_p*N))
        total_pops += each_o
    shuffle(total_pops)    
    return total_pops
Parthiban Rajendran
  • 430
  • 1
  • 7
  • 18

3 Answers3

1

can you please share your matplotlib settings? I think you have the plot truncated, you are correct in that the sample distribution of the sample proportion on a bernoulli should be normally distributed around the population expected value ...

perhaps using something as:

plt.tight_layout()

to check if there are no graph issues

Daniel Vieira
  • 461
  • 5
  • 19
1
def plotHist(nr, N, n_):
    ''' plots the RVs'''
    x = np.zeros((N))
    sp = f.add_subplot(3, 2, n_ )

    for i in range(N):    
        for j in range(nr):
            x[i] += np.random.binomial(10, 0.6)/10 
        x[i] *= 1/nr
    plt.hist(x, 100, normed=True, color='#348ABD', label=" %d RVs"%(nr));
    plt.setp(sp.get_yticklabels(), visible=False)


N = 1000000   # number of samples taken
nr = ([1, 2, 4, 8, 16, 32])

for i in range(np.size(nr)):
    plotHist(nr[i], N, i+1)

Above is a code sample based on a general blog I wrote on CLT: https://rajeshrinet.github.io/blog/2014/central-limit-theorem/

Essentially, I am generating several random numbers (nr) from a distribution in the range (0,1) and summing them. Then I see, how do they converge as I increase the number of the random numbers.

Here is a screenshot of the code and the result.

Rajesh
  • 11
  • 3
  • your np.random.binomial will always return random value from binomial distribution? A binomial distribution always converges to normal. In my case, I am trying with bernoulli distribution and uniform. Can you please check my code if I am missing anything? – Parthiban Rajendran Aug 07 '18 at 12:07
  • Sorry for getting your question wrong! I have not redone the exercise for a Binomial for n=1, which is Bernoulli. You have a long code! I will try to find some time to look into it. – Rajesh Aug 07 '18 at 13:05
  • Also, the above can be modified by replacing ``x[i] += np.random.binomial(10, 0.6)/10`` by ``x[i] += np.random.binomial(1, 0.6)``. Then, this is Bernoulli. You can see it still tends to a Gaussian. I have also sent you a screenshot over email. – Rajesh Aug 07 '18 at 13:12
  • I highly appreciate responding my request and helping me here. looking forward for feedback. I also tried running your code with n=1 in random.binomial, and got a non normal distribution as [here](https://s22.postimg.cc/td989xawh/image.png) – Parthiban Rajendran Aug 07 '18 at 13:13
  • strange, I am getting only one plot (along with a warning), and non normal distribution as [here](https://s22.postimg.cc/3vqtpxvmp/2018-08-07_18h44_33.png) again – Parthiban Rajendran Aug 07 '18 at 13:15
  • yeah, I saw your image so wondering y mine is different. also it takes very long time to execute for 1000000 – Parthiban Rajendran Aug 07 '18 at 13:22
  • I solved the warning and single graph issue, now its producing like what you have shown, I am exploring further how it could be related. please revert once you get time to look at my code. – Parthiban Rajendran Aug 07 '18 at 13:51
  • Investigating and trying to map your code to my logic helped to find the culprit buddy, now the graphs are much better, without deviating much from my logic. I will post the updated code, Please review if conceptually ok. – Parthiban Rajendran Aug 07 '18 at 15:57
0

Solution:
I guess I have arrived at the solution. By reverse engineering Rajesh's approach and taking hint from Daniel if graph could be an issue, finally I have figured out the culprit: default bar graph width being 0.8 is too wide to show my graph as flattened on top. Below is modified code and output.

from SDSP import create_bernoulli_population, get_frequency_df
from random import shuffle, choices
from bi_to_nor_demo import get_metrics, bare_minimal_plot
import matplotlib.pyplot as plt

N = 10000  # 10000 balls
p = 0.6    # probability of yellow ball is 0.6, and others (1-0.6)=>0.4
n_pickups = 10       # sample size
n_experiments = 2000  # I dont know what this is called 


# THEORETICAL PDF
# generate population and calculate theoretical bernoulli pdf
population = create_bernoulli_population(N,p)
theor_df = get_frequency_df(population)


# STATISTICAL PDF
# choose sample, take mean and add to X_mean_list. Do this for n_experiments times. 
X_hat = []
X_mean_list = []
for each_experiment in range(n_experiments):
    X_hat = choices(population, k=n_pickups)  # choose, say 10 samples from population (with replacement)
    X_mean = sum(X_hat)/len(X_hat)
    X_mean_list.append(X_mean)
stats_df = get_frequency_df(X_mean_list)


# plot both theoretical and statistical outcomes
fig, (ax1,ax2) = plt.subplots(2,1, figsize=(5,10))
from SDSP import plot_pdf
mu,var,sigma = get_metrics(theor_df)
plot_pdf(theor_df, ax1, mu, sigma, p, title='True Population Parameters')
mu,var,sigma = get_metrics(stats_df)
plot_pdf(stats_df, ax2, mu, sigma, p=mu, bar_width=round(0.5/n_pickups,3),title='Sampling Distribution of\n a Sample Proportion')
plt.tight_layout()
plt.show()

Output:
output_solved

Parthiban Rajendran
  • 430
  • 1
  • 7
  • 18