Pandas apply function to groups, and filter the original dataframe

Question

I have a DataFrame containing objects and their coordinates:

      id        lat         lng
0   3816  18.384001  -66.114799
1   5922  20.766100 -156.434998
2   1527  21.291394 -157.843085
3   1419  21.291394 -157.843085
4   1651  21.291394 -157.843085

Multiple objects can have same coordinates. The dataframe is big (millions of records). I have a target point with coordinates (target_lat, target_lng). My goal is to find objects in the dataframe which are within X miles of the target point, as efficiently as possible.

I'm using haversine_np function adapted from this question. It takes parameters (lat_series, lng_series, lat, lng) and efficiently computes all distances between lat_series, lng_series (two Series) and (lat, lng) (two numbers).

Now my question is how to use it to filter the distances and select objects in the original dataframe.

This is my current solution:

grouper = df.groupby(['lat', 'lng'], sort=False).grouper
lat_series = grouper.result_index.get_level_values(0)  # lats of unique (lat, lng) pairs
lng_series = grouper.result_index.get_level_values(1)  # lngs of unique (lat, lng) pairs
df['location_index'] = grouper.group_info[0]  # assign index of group back to df
distances = haversine_np(lat_series, lng_series, target_lat, target_lng)
mask = distances <= 50  # let's say 50 miles; boolean mask of size = ngroups
loc_indexes = pd.Series(range(grouper.ngroups))[mask]  # select group indexes by mask
df[df.location_index.isin(loc_indexes)]  # select original records by group indexes

It seems to work, although doesn't look reliable, because when I select the relevant group indexes by using pd.Series(range(grouper.ngroups))[mask], I assume that the level values of the grouped are naturally indexed (from 0 to ngroups-1). In other words, I'm relying on the fact that the i-th element in grouper.result_index.get_level_values() corresponds to the group with label i in grouper.group_info[0]. I couldn't find a more explicit way to get that mapping.

Questions:

Is the method I'm using reliable?
Is there a better (safer / more concise / more efficient) method?

score 1 · Accepted Answer · edited May 23 '17 at 12:10

UPDATE: @DennisGolomazov has found out that this "prefiltering" is not going to work properly for longitudes and make a very good example - here is a small demo:

In [115]: df
Out[115]:
     id   lat    lng
5  4444  40.0 -121.0
0  1111  40.0 -120.0

In [116]: %paste
threshold = 60
max_lng_factor = 69.17
max_lat_factor = 69.41
target_lat, target_lng = 40, -120
mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
       & \
       df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)
x = df.loc[mask, ['lat','lng']].drop_duplicates()
## -- End pasted text --

In [117]: x
Out[117]:
    lat    lng
0  40.0 -120.0

where the distance between these two coordinates is less than our threshold (60 miles):

In [119]: haversine_np(-120, 40, -121, 40)
Out[119]: 52.895043596886239

Conclusion: we can prefilter latitudes, but not the longitudes:

In [131]: df
Out[131]:
     id   lat    lng
5  4444  40.0 -121.0
0  1111  40.0 -120.0
1  2222  42.0 -121.0

Correct prefiltering:

In [132]: mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor)
     ...: x = df.loc[mask, ['lat','lng']].drop_duplicates()
     ...:

In [133]: x
Out[133]:
    lat    lng
5  40.0 -121.0
0  40.0 -120.0

Check:

In [135]: df.reset_index() \
     ...:   .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
     ...:           .query("distance <= @threshold"),
     ...:          on=['lat','lng'])
     ...:
Out[135]:
   index    id   lat    lng   distance
0      5  4444  40.0 -121.0  52.895044
1      0  1111  40.0 -120.0   0.000000

Old, partially incorrect answer:

I would try to do prefiltering in order to optimize the calculations. For example you can easily filter out the points that are definitely outside of your "rectangle of interest".

Demo:

threshold = 100

# http://gis.stackexchange.com/questions/142326/calculating-longitude-length-in-miles/142327#142327
max_lng_factor = 69.17
max_lat_factor = 69.41

target_lat, target_lng = 21.29, -157.84

mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
       & \
       df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)

x = df.loc[mask, ['lat','lng']].drop_duplicates()

df.reset_index() \
  .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
          .query("distance <= @threshold"),
         on=['lat','lng']) \
  .drop('distance',1) \
  .set_index('index')

Result:

In [142]: df.reset_index() \
     ...:   .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
     ...:           .query("distance <= @threshold"),
     ...:          on=['lat','lng']) \
     ...:   .drop('distance',1) \
     ...:   .set_index('index')
     ...:
Out[142]:
         id        lat         lng
index
1      5922  20.766100 -156.434998
2      1527  21.291394 -157.843085
3      1419  21.291394 -157.843085
4      1651  21.291394 -157.843085

One problem with this is that 1 degree of longitude is equal to 69.17 miles only at equator, and closer to poles it's less, which will incorrectly exclude points which are close enough by distance, but far enough by longitude. — Dennis Golomazov, Apr 08 '17 at 19:36
@DennisGolomazov, that is why i called them __max__ factor. I'm filtering only those points - outside of __maximum__ rectangle — MaxU - stand with Ukraine, Apr 08 '17 at 22:07
Correct me if I'm wrong, but I think it's incorrect. The max factor is in denominator (`threshold/max_lng_factor`), so the rectangle with largest factor is actually the smallest one. But closer to the poles the rectangle should be larger, so the factor should be smaller (down to zero at the poles). Hope my explanation makes sense. — Dennis Golomazov, Apr 08 '17 at 23:25
@DennisGolomazov, i think you are right. So we can prefilter only `latitude`, but not the `longitude` — MaxU - stand with Ukraine, Apr 08 '17 at 23:32
Yes, agree, for latitude it should work, and it's helpful. Thanks! — Dennis Golomazov, Apr 08 '17 at 23:35
This is an example. Let `target_lat, target_lng = (40, -120)`. Consider `point = (40, -121)` and `threshold = 60`. Then this point will not be selected by your function, because `threshold/max_lng_factor = 0.867` and `df.lng.sub(target_lng).abs() = 121-120 = 1`. But in reality the distance between these points is `85km = 52.8 miles < 60`, so the point should be selected. — Dennis Golomazov, Apr 09 '17 at 01:29
@DennisGolomazov, yeah, very good example - thank you! You are absolutely right! I'm going to correct the answer... — MaxU - stand with Ukraine, Apr 09 '17 at 12:07

score 0 · Answer 2 · answered Apr 08 '17 at 13:43

0

Maybe I'm missing something on the efficiency, but I don't understand why you are using the .grouper method. To get the Lat and Long series just reference them, i.e. df['lat'] or df.lat, then you can directly compute the distances with

distances = haversine_np(df.lat, df.lng, target_lat, target_lng)

and create a mask with

mask = distances <= 50

The mask is now indexed to the dataframe.

df[mask]

will provide only the True elements.

answered Apr 08 '17 at 13:43

Jim Parker

1,095
11
16

The problem here is that there are many duplicate values in `lat/lng` columns, and I want to calculate `haversine_np` only for the unique values. Otherwise, yes, your method would be the simplest one. – Dennis Golomazov Apr 08 '17 at 17:49

Pandas apply function to groups, and filter the original dataframe

2 Answers2