10

My data--a 196,585-record numpy array extracted from a pandas dataframe--are being placed into a single bin by matplotlib.hist. The data were originally integers, so I tried converting them to float as wel, as shown below, but they are still not being distributed among 10 bins.

Interestingly, a small sub-sample (using df.sample(0.00x)) of the integer data are successfully distributed.

Any suggestions on where I may be erring in data preparation or use of matplotlib's histogram function would be appreciated.

histogram output

x = df[(df['UNIT']=='X')].OPP_VALUE.values
num_bins = 10
n, bins, patches = plt.hist((x[(x>0)]).astype(float), num_bins, normed=False, facecolor='0.5', alpha=0.8)
plt.show()
tdy
  • 36,675
  • 19
  • 86
  • 83
A. Slowey
  • 117
  • 1
  • 2
  • 12
  • 2
    try using `log=True` - your sample contains very few large values which skew the distribution. You may have to think about removing them. – cel Aug 02 '16 at 17:49
  • 1
    Yup. Looks like you need to zoom in all the way in. Can you print the output of `print(n); print(bins);`. – Mad Physicist Aug 02 '16 at 17:52
  • You hit the nail on the head, so much so that log=True even doesn't work: **print(bins)** [ 1.00000000e+00 3.00000000e+09 6.00000000e+09 9.00000000e+09 1.20000000e+10 1.50000000e+10 1.80000000e+10 2.10000000e+10 2.40000000e+10 2.70000000e+10 3.00000000e+10] **print(n)** [ 1.86114000e+05 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00] – A. Slowey Aug 02 '16 at 18:05

2 Answers2

5

Most likely what is happening is that the number of data points with x > 0.5 is very small but you do have some outliers that forces the hist function to pick the scale it does. Try removing all values > 0.5 (or 1 if you do not want to convert to float) and then plot again.

  • Im also facing this issue, could you explian a littile elaborately,, I am plotting after removing outliers using z score and I am getting this – Scope May 14 '21 at 16:00
-1

you should modify number of bins, for exam

number_of_bins = 200
bin_cutoffs = np.linspace(np.percentile(x,0), np.percentile(x,99),number_of_bins)