-1

I'm using the matplotlib to draw a pdf histogram and need to use the range variable due to the appearance of the graph. Got a high in the start and in the end, the probability chance is so much higher for these peaks so the rest of the graph can't be seen so I need to use range to 'zoom' in. But when range is used probability density will only consider the data within the range.

Is there a way to continue using range but the probability density is calculated not only with the data in the given range but all data?

Thanks in advance!

Edit: I'm plotting the pdf of packet sizes for a data set. The graph have peaks in the lower region ~100 bytes and at the upper region ~1450 bytes. To show the distribution in the middle of the data set I use range to zoom in different areas which gives better detail for the distribution.

ax.hist(x=list_of_pkt_sizes,bins=25,density=True,range=[500,1000])

This is an example of code snippet used to plot one of the zoomed in areas. As said above it now only shows the distribution for given range. I want the overall distribution.

CNAP
  • 55
  • 1
  • 9
  • 1
    Please provide the code used to generate your PDF. It seems to me that you should be able to provide the a restricted sequence of `bins=` instead of using `range=` but it's hard to be sure without data to experiment with. Please refer to [How to ask](https://stackoverflow.com/help/how-to-ask) – Diziet Asahi Sep 17 '18 at 10:08
  • Added code snippet plus some more clarification. See edit – CNAP Sep 17 '18 at 11:06
  • Is the problem simply with the *visualization* of the histogram? If so, could you just change the x-axis limits (`plt.xlim(500,1000)`) to only show the region that you are interested in? – Diziet Asahi Sep 17 '18 at 12:20
  • That's a good point @DizietAsahi. The only issue is choosing the appropriate number of bins – Seth Nabarro Sep 17 '18 at 12:25

2 Answers2

1

Not the most elegant solution, but you could quite easily normalise manually:

import numpy as np

# Convert list to numpy array for convenience
pkt_arr = np.array(list_of_pkt_sizes)

# Set range variables
min_range, max_range = 500, 1000

# Filter out elements not in range to new array
pkt_arr_in_range = pkt_arr[(pkt_arr > min_range) & (pkt_arr < max_range)]

# Get normalisers - bin size and total number of elements
num_elem_norm = pkt_arr.shape[0]
counts, bins = np.histogram(x=pkt_arr_in_range, bins=25)
bin_width = bins[1] - bins[0]

# Get x coordinates of LHS of bins
xs = bins[:-1]

# Normalise counts (prob density per unit of input)
counts_norm = counts / (num_elem_norm * bin_width)

# Use bar chart
ax.bar(xs, counts_norm, width=bin_width, align='edge')

UPDATE: @DizietAsahi makes a better suggestion in their comment:

min_range, max_range = 500, 1000
min_all, max_all = min(list_of_pkt_sizes), max(list_of_pkt_sizes)
range_ratio = (max_all - min_all) / (max_range - min_range)
ax.hist(list_of_pkt_sizes, bins=int(round(25 * range_ratio)), density=True)
plt.xlim(min_range, max_range)
Seth Nabarro
  • 111
  • 4
1

Here is how I would tackle the problem. I've generated a fake distribution with large numbers of low and high values as per your information

plt.figure()
plt.hist(l1, density=True, bins=25)

enter image description here

I use the numpy.histogram function to obtain the density distribution. Notice that I use a custom bins= argument: I request one bin from 0-500, 25 bins between 500 and 1000 and 1 bin between 1000 and 2000

p,b = np.histogram(l1, density=True, bins=[0]+list(np.linspace(500,1000,25+1))+[2000])

enter image description here

Finally, I use matplotlib's bar() function to plot the resulting histogram, but I simply omit the first and last bin

plt.figure()
plt.bar(x=b[1:-2], height=p[1:-1], width=20, align='edge')

enter image description here

Diziet Asahi
  • 38,379
  • 7
  • 60
  • 75
  • Thank you this looks like a solution to my problem. I have a lot on my plate so it might take some days before I can test it, but I will make sure to mark it as a solution if it works! – CNAP Sep 18 '18 at 08:02