0

Trying to generate a simple histogram using 1% bins and a simple normal distribution but I am getting incredibly small bin counts - where am I messing up the implementation of np.histogram?

Here is the basic implementation:

import streamlit as st
import math
import pandas as pd
import numpy as np
from numpy.random import normal
import random
import matplotlib.pyplot as plt
import plotly.graph_objects as go

mean = 600000
uncertainty = 5.02
st_dev = mean * uncertainty/100

year1_dist = normal(mean, st_dev, 10000)

bin_size = mean * 0.01
nbins = math.ceil((year1_dist.max() - year1_dist.min()) / bin_size)
hist, bin_edges = np.histogram(year1_dist, bins=nbins, density=True)

The values stored in hist are very small (sum to something like 0.00017) - I have also tried plotting the histogram using plotly with the following implementation which returns the same results (very low frequency or occurrence)

fig = go.Figure(data=[go.Histogram(x=year1_dist, nbinsx=nbins, name='Histogram')])

Plotly Histogram showing very very small x-axis values

Ultimately, I would like to have a CDF overlaid on the histogram to resemble something like this though I know there will be some normalization involved on the histogram frequency and I need to reset my inputs a bit to have a mean at zero.

Desired final depiction

I have the CDF plotted and generated as expected and I am implementing the tool in streamlit. Here is the plotting section of my code which shows the CDF and Histogram (albeit with the histrogram values being very very low)

    bin_size = mean * 0.01
    nbins = math.ceil((year1_dist.max() - year1_dist.min()) / bin_size)
    hist, bin_edges = np.histogram(year1_dist, bins=nbins, density=True)
    cdf = np.cumsum(hist * np.diff(bin_edges))
    fig = go.Figure(data=[
        go.Histogram(x=year1_dist, nbinsx=nbins, name='Histogram'),
        go.Scatter(x=bin_edges, y=cdf, name='CDF')
    ])
    st.plotly_chart(fig, use_container_width=True)
C. Nielsen
  • 11
  • 2
  • All of the data populated into the histogram is generated from the code above (its just a normal distribution) - the code imports are as follows but are minimal: import streamlit as st import math import pandas as pd import numpy as np from numpy.random import normal import random import matplotlib.pyplot as plt import plotly.graph_objects as go – C. Nielsen Apr 26 '22 at 19:09

1 Answers1

2

From the docs:

density bool, optional
If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1.

I suspect you did np.sum(hist) and got your 0.0017. np.sum(hist)*bin_size should give you the correct value of 1

FlyingTeller
  • 17,638
  • 3
  • 38
  • 53
  • Because the variable `bin_size` is used to calculate `nbins`, and that calculation uses `math.ceil`, the actual bin size is probably not exactly `bin_size`. A better expression for the quantity that should be 1 is `np.sum(hist)*(bin_edges[1] - bin_edges[0])`. – Warren Weckesser Apr 26 '22 at 16:08
  • In either case - the x-axis for the histogram is not frequency which would be my anticipated result for a histogram (i.e. establish bins, count # of occurrences in each bin) What am I missing in my implementation to get that result? – C. Nielsen Apr 26 '22 at 19:23
  • @C.Nielsen was x-axis a typo? You probably want the y-axis to be in percent, i.e. how many percent of values are in a specific bin. Is that correct? – FlyingTeller Apr 27 '22 at 14:03