0

I am using matplotlib's hist2d function to make a 2d histogram of data that I have, however I am having trouble interpreting the result.

Here is the plot I have:

enter image description here

This was created using the line:

hist = plt.hist2d(X, Y, (160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))

This returns a 2d array of (160, 160), as well as the bin edges etc.

In the plot there are bins which have a high frequency of values (yellow bins). I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins. But I would expect there to be 160*160 values, but I can only find 160 X and 160 Y values.

What I would like to do is essentially filter out the more dense data from the less dense data. If this means representing the data as a single value (a bin), then that is ok.

Am I misinterpreting the function or am I not accessing the data results correctly? I have tried with spicy also but the results seem to be in the same or similar format.

GeoMonkey
  • 1,615
  • 7
  • 28
  • 56
  • Why would you expect `160 x 160` values? Isn't that all of your histogram? – Quang Hoang Jun 02 '20 at 00:43
  • @QuangHoang perhaps I misunderstand, but shouldn't there be 160 bins in the x-axis and then 160 bins in the Y axis, for each of the X bins? So every Xbin should have 160 Ybins? – GeoMonkey Jun 02 '20 at 00:46
  • Yes, that’s your original histogram, I.e. the picture you plotted. But didn’t you want to filter just the high density bins? – Quang Hoang Jun 02 '20 at 00:56
  • yes, I'd like the histogram as arrays so that I can filter the results after plotting. But I can only find 160 X and 160 Y bins, but shouldn't the full histogram have 25,600? – GeoMonkey Jun 02 '20 at 00:58
  • I’m not sure I followed. You stated clearly hat you get a `160 x 160` array. That is you me histogram, is it not? The functions returns only the edges of `x` and `y`, which is 161 each. You can make the cross product of the edges to get the 2D bins, if that’s what you’re asking. – Quang Hoang Jun 02 '20 at 01:10
  • Yes, I think I may have confused myself. I'll put it this way, what I would like is the frequency value for any bin in the histogram. So, for example, how would I get the frequency of the bin that is 100 bins in the x direction and 100 bins in the y direction – GeoMonkey Jun 02 '20 at 01:23

2 Answers2

0

You need Seaborn package.

You mentioned

I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins.

You should definitely be using one of those:

  1. seaborn.joinplot(...,kind='hex') : it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large dataset.
  2. seaborn.joinplot(...,kind='kde') : use the kernel density estimation to visualize a bivariate distribution. I recommed it better.

Example 'kde'

Use number of levels n_levels and shade_lowest=False to ignore low values.

import seaborn as sns
import numpy as np
import matplotlib.pylab as plt
x, y = np.random.randn(2, 300)
plt.figure(figsize=(6,5))
sns.kdeplot(x, y, zorder=0, n_levels=6, shade=True, cbar=True, 
     shade_lowest=False, cmap='viridis')

enter image description here

Community
  • 1
  • 1
imbr
  • 6,226
  • 4
  • 53
  • 65
0

Not sure if this is what you wanted.

The hist2d docs specify that the function returns a tuple of size 4, where the first item h is a heatmap.

This h will have the same shape as bins.

You can capture the output (it will still plot), and use argwhere to find coordinates where values exceed, say, the 90th percentile:

h, xedges, yedges, img = hist = plt.hist2d(X, Y, bins=(160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))

print(list(np.argwhere(h > np.percentile(h, 90))))
ELinda
  • 2,658
  • 1
  • 10
  • 9