12

I have a bunch of geographical data as below. I would like to group the data by bins of .2 degrees in longitude AND .2 degree in latitude.

While it is trivial to do for either latitude or longitude, what is the most appropriate of doing this for both variables?

|User_ID  |Latitude  |Longitude|Datetime           |u    |v    |
|---------|----------|---------|-------------------|-----|-----|
|222583401|41.4020375|2.1478710|2014-07-06 20:49:20|0.3  | 0.2 |
|287280509|41.3671346|2.0793115|2013-01-30 09:25:47|0.2  | 0.7 |
|329757763|41.5453577|2.1175164|2012-09-25 08:40:59|0.5  | 0.8 |
|189757330|41.5844998|2.5621569|2013-10-01 11:55:20|0.4  | 0.4 |
|624921653|41.5931846|2.3030671|2013-07-09 20:12:20|1.2  | 1.4 |
|414673119|41.5550136|2.0965829|2014-02-24 20:15:30|2.3  | 0.6 |
|414673119|41.5550136|2.0975829|2014-02-24 20:16:30|4.3  | 0.7 |
|414673119|41.5550136|2.0985829|2014-02-24 20:17:30|0.6  | 0.9 |

So far what I have done is created 2 linear spaces:

lonbins = np.linspace(df.Longitude.min(), df.Longitude.max(), 10) 
latbins = np.linspace(df.Latitude.min(), df.Latitude.max(), 10)

Then I can groupBy using:

groups = df.groupby(pd.cut(df.Longitude, lonbins))

I could then obviously iterate over the groups to create a second level. My goal being to do statistical analysis on each of the group and possibly display them on a map it does not look very handy.

bucket = {}
for name, group in groups: 
    print name bucket[name] = group.groupby(pd.cut(group.Latitude, latbins))

For example I would like to do a heatmap which would display the number of rows per latlon box, display distribution of speed in each of the latlon boxes, ...

Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
tog
  • 887
  • 1
  • 12
  • 22

3 Answers3

14

How about this?

step = 0.2
to_bin = lambda x: np.floor(x / step) * step
df["latBin"] = to_bin(df.Latitude)
df["lonBin"] = to_bin(df.Longitude)
groups = df.groupby(["latBin", "lonBin"])
Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
  • thanks, that's a good solution. If I want to more easily convert the statistics back to a layer that I could possibly add to a Basemap (I am not sure this is possible) may be I should mix your solution with the linspace. For example: to_lonbin = lambda x: lonbin.searchsorted(x) What do you thin? – tog Sep 01 '16 at 20:29
  • That should work, but searching does not seem necessary here. to_lonbin = lambda x: np.floor((x + 180) / step) should produce the same result. – Martin Valgur Sep 01 '16 at 20:39
  • say that I have to groupby the data not on a simple regular 0.2 grid, but a given grid (for which I have lat_bin_edges and lon_bin_edges values, but it is not exactly regular). how do i modify this? – claude Nov 16 '16 at 20:14
  • You could replace the `to_bin` with a function that maps the coordinate to an arbitrary bin. Then you can do `groupby()` on the bin(s) like in the above example. – Martin Valgur Nov 16 '16 at 21:22
  • thanks - i think though that in your reply you should have "round" and not "floor", correct? latbin,lonbin in groupby are the center of your bin, and not the edges, so I think it matters? am I off? – claude Nov 17 '16 at 16:19
  • Hello @MartinValgur, I'd like to get the bounding box coordinates for each of the locations provided by the solution. I think it's as simple as adding the step value (in this case 0.2) to columns latbin and lonbin, to create two new dependent columns from the first two. Is that correct? I am having trouble formatting the code so I put the sample code in an answer below. df["latbin"] = df.Lat.map(to_bin) df["lonbin"] = df.Long.map(to_bin) df['latbin_bound'] = df['latbin'] + step df['lonbin_bound'] = df['lonbin'] + step – Inder Jalli Apr 12 '21 at 04:42
  • The answer loos great. I wonder if we want to calculate the average value of `u` or `v` in the above data set for the 0.2 * 0.2 bins, how should we proceed? Thanks. – Huanian Zhang Nov 11 '21 at 03:01
  • whats the point of using map function? doesn't it defeats the vectorization propose? – adir abargil Jan 18 '22 at 18:05
  • Yeah, there's no reason for that. It's trivial to convert it into a vectorized function call and it's 25x faster by my measurement. Thanks for pointing that out! – Martin Valgur Jan 19 '22 at 15:59
1

Here is a significantly faster solution:

Option 1:

Use this if you want binEdges to be precisely every 0.2 degrees and at floats like [66.6, 66.8, 67.0, .....]:

import numpy as np
step = 0.2
boundingBox = {"lat":
               {"min": np.floor(df.Latitude.min()/step)*step,
                "max": np.ceil(df.Latitude.max()/step)*step},
               "lon":
               {"min": np.floor(df.Longitude.min()/step)*step,
                "max": np.ceil(df.Longitude.max()/step)*step}
               }
noOfLatEdges = int(
    (boundingBox["lat"]["max"] - boundingBox["lat"]["min"]) / step)
noOfLonEdges = int(
    (boundingBox["lon"]["max"] - boundingBox["lon"]["min"]) / step)
latBins = np.linspace(boundingBox["lat"]["min"],
                      boundingBox["lat"]["max"], noOfLatEdges)
lonBins = np.linspace(boundingBox["lon"]["min"],
                      boundingBox["lon"]["max"], noOfLatEdges)
H, _, _ = np.histogram2d(df.Latitude, df.Longitude, bins=[latBins, lonBins])
binnedData = H.T # Important as otherwise the axes are wrong way around (I missed this for ages, see "Notes" of Numpy docs for histogram2d())

Option 2:

Use this if you want binEdges to be almost exactly every 0.2 degrees and at floats like [66.653112, 66.853112, 67.053112, .....]:

import numpy as np
step = 0.2
boundingBox = {"lat":
               {"min": df.Latitude.min(), "max": df.Latitude.max()},
               "lon":
               {"min": df.Longitude.min(), "max": df.Longitude.max()}
               }
noOfLatEdges = int(
    (boundingBox["lat"]["max"] - boundingBox["lat"]["min"]) / step)
noOfLonEdges = int(
    (boundingBox["lon"]["max"] - boundingBox["lon"]["min"]) / step)
H, xedges, yedges = np.histogram2d(df.Latitude, df.Longitude, bins=[
                                   noOfLatEdges, noOfLonEdges])
binnedData = H.T # Important as otherwise the axes are wrong way around (I missed this for ages, see "Notes" of Numpy docs for histogram2d())

Runtime comparisons between Option 1, 2 and Martin Valgur's accepted solution:

This answer may seem longer to many, however I recently did a project where this was a time critical component run extremely frequently so this answer helped me immensely (% wise) cut down our API computation time.

Runtimes calculated on a 16k row DataFrame split into 964 by 1381 buckets.

Current accepted answer by Martin Valgur has a runtime of:

30.1 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: Probably closer to 1.5 ms when the bin calculation is done in a vectorized manner.

Option 1 has runtime of:

6.78 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Option 2 has runtime of:

8.17 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Therefore, Option 1 is almost 5x faster than current solution. However, if dimensionality of buckets becomes too large for modern memory, Martin's Pandas solution will likely be more performant.

Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
shadowbourne
  • 78
  • 1
  • 9
  • did you guys tried the Martin Valgur's solution without map? – adir abargil Jan 18 '22 at 18:03
  • My answer was before someone improved Martin's solution using more vectorisation, so now his should be faster :) But this answer still provides more options in terms of how one wants the bounds set out. – shadowbourne Jan 23 '22 at 21:39
0

An even shorter variant than the accepted answer is

step = 0.2
df["latBin"] = (df.Latitude // step) * step
df["lonBin"] = (df.Longitude // step) * step
groups = df.groupby(["latBin", "lonBin"])
asmaier
  • 11,132
  • 11
  • 76
  • 103