How to group and aggregate data using pandas/Python only if a specific condition/calculation is met?

Question

There is a pandas.DataFrame df that looks like this:

City     Country   Latitude    Longitude      Population   ...

Berlin   Germany   52.516602   13.304105      118704
Berlin   Germany   52.430884   13.192662      292000
...
Berlin   USA       39.7742446  -75.0013423    7588
Berlin   USA       43.9727912  -88.9858084    5524

I would like to group data by columns City and Country and sum up their population:

grouped_data = df.groupby([df['City'], df['Country'])['Population'].agg('sum').reset_index()

But in order to handle ambiguity – the two entries for USA are not to be merged –, my idea was to calculate and check the distance between lat/long for every potential groupby()-result.

Assuming to have a distance function that returns the distance of two geographic points in kilometres, I'd like to group all entries by City and Country and sum up their population only if the result of distance() is e.g. less than 50 kilometres.

The output for the example above could look like:

City    Country  Latitude                Longitude              Population

Berlin  Germany  [52.516602, 52.430884]  [13.304105, 13.192662] 410704
...
Berlin  USA      39.7742446              -75.0013423            7588
Berlin  USA      43.9727912              -88.9858084            5524

Any idea how to solve this in pandas? I am happy for your suggestions.

What would you want to happen if the distance is greater than 50 km? Ignore all the entries in the group? Break up the group? — busybear, Dec 17 '20 at 16:26
Also, are there always only two entries per group? Calculating distances would get a bit more complicated if there are more than two entries. — busybear, Dec 17 '20 at 16:26
@busybear Thanks for your questions! If the distance is greater than 50 km, I would like to only ignore this – meaning the second – entry. The first appearance of a city and country should be the one two match against. And yes, there could be more than two entries per group. — jengeb, Dec 17 '20 at 16:45

score 1 · Answer 1 · answered Dec 17 '20 at 16:26

What you are asking for is rather a network problem where two nodes are connected if their distance is < 50 km. In doing so, you can create a distance matrix and build up the graph with networkx. Something along this line:

from sklearn.metrics.pairwise import haversine_distances as haversine

# calculate haversine
dist_mat = haversine(np.deg2rad(df[['Latitude','Longitude']]) ) * 6371  # earth's radius

adjacency = dist_mat < 50

import networkx as nx
G = nx.from_numpy_matrix(adjacency)
components = nx.connected_components(G)

And then you can groupby on that components

On the other hand, it might be easier for you to allow binning of the Lat/Long and groupby on those bins.

How to group and aggregate data using pandas/Python only if a specific condition/calculation is met?

1 Answers1