Pandas how to bin and groupby without categorical range of values

Question

I have a large number of latitude and longitude values that I would like to bin together in order to display them on a heatmap (ipyleaflet only seems to allow 2000 or so points in the heatmap and this would also be much more efficient when using big data).

I am actually using vaex, but a answer in terms of pandas is also fine.

The pandas pd.cut function seems to be helpful in terms of binning, however it produces a categorical column (category dtype) which looks like a list of all the values in the bin. Is there some way of changing this to just be an incremental number identifying each bin (thanks jezreal for that part of the answer)? I just need a bin number in order to then groupby the bin number and average (mean) on the latitude and longitude columns. I also need a count for the intensity of the heatmap entry.

For example:

dft = pd.DataFrame({
    'latitude': [1.5, 0.5, 1.2, 0.9, 3],
    'longitude': [3, 0.2, 2, 0.2, 1.1]
    })

dft['bin'] = pd.cut(dft['latitude'], bins=3, labels=False).astype(str) + "_" + pd.cut(dft['longitude'], bins=3, labels=False).astype(str)

dft.groupby('bin').agg(['mean', 'count']).unstack()

Almost gives me the answer, but I think I want this output instead:

bin latitude_mean longitude_mean count
0_0 0.7           0.2            2
0_1 1.2           2.0            1
1_2 1.5           3.0            1
2_0 3.0           1.1            1

It would be helpful if the count can be normalized between 1 and 1000.

How can I use pandas pd.cut or something else to groupby the bin in the rows, average on the latitude and longitude and (heatmap intensity) count in the columns?

What means `intensity count` ? Is possible add some small data sample and see expected output? — jezrael, Mar 24 '21 at 08:11
`intensity count` is the count of the number of the values in the bin which would then be used as a heatmap intensity. The value should be normalized between 1 and 1000. — Superdooperhero, Mar 24 '21 at 09:25
Super, thank you. So finally after normalize how looks ouput? — jezrael, Mar 24 '21 at 09:26
I guess the normalization should happen as an additional step, whereby it takes the maximum count and then scale the counts so that 1000 becomes the new maximum and the other values interpolate as integers inbetween. — Superdooperhero, Mar 24 '21 at 09:28

jezrael · Accepted Answer · 2021-03-24T09:39:55.277

The pandas pd.cut function seems to be helpful in terms of binning, however it produces a categorical column (category dtype) which looks like a list of all the values in the bin. Is there some way of changing this to just be an incremental number identifying each bin

Yes, use label=False parameter in cut:

labels array or False, default None
Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins.

Last use GroupBy.agg for aggregation and last normalize count column:

df = dft.groupby('bin').agg(latitude_mean=('latitude','mean'),
                            longitude_mean=('longitude','mean'),
                            count=('latitude','count'))

#https://stackoverflow.com/a/50028155/2901002
a, b = 1, 1000
x, y = df['count'].min(),df['count'].max()
df['count'] = (df['count'] - x) / (y - x) * (b - a) + a

print (df)

     latitude_mean  longitude_mean   count
bin                                       
0_0            0.7             0.2  1000.0
0_1            1.2             2.0     1.0
1_2            1.5             3.0     1.0
2_0            3.0             1.1     1.0

Thanks for the first part of the answer, I have changed my question to hopefully be more clear about my needs and to provide an example. — Superdooperhero, Mar 24 '21 at 09:25

Pandas how to bin and groupby without categorical range of values

1 Answers1