Pandas dataframe : Applying function to row value and value from the previous row

Question

I am trying to apply the following function to a Pandas dataframe:

def eukarney(lat1, lon1, alt1, lat2, lon2, alt2):
    p1 = (lat1, lon1)
    p2 = (lat2, lon2)
    karney = distance.distance(p1, p2).m
    return np.sqrt(karney**2 + (alt2 - alt1)**2)

This works if I use discrete values such as for instance:

distance = eukarney(49.907611, 5.890404, 339.15734, 49.907683, 5.890373, 339.18224)

However, if I try to apply the function to a Pandas dataframe:

df['distances'] = eukarney(df['latitude'], df['longitude'], df['altitude'], df['latitude'].shift(), df['longitude'].shift(), df['altitude'].shift())

Which means taking values from a row and the previous one.

I receive the following error message:

Traceback (most recent call last): File "/home/mirix/Desktop/plage/GPX_invert_sense_change_starting_point_va.py", line 78, in df['distances'] = eukarney(df.loc[:,'latitude':], df.loc[:,'longitude':], df.loc[:,'altitude':], df.loc[:,'latitude':].shift(), df.loc[:,'longitude':].shift(), df.loc[:,'altitude':].shift()) File "/home/mirix/Desktop/plage/GPX_invert_sense_change_starting_point_va.py", line 75, in eukarney karney = distance.distance(p1, p2).m File "/home/mirix/.local/lib/python3.9/site-packages/geopy/distance.py", line 522, in init super().init(*args, **kwargs) File "/home/mirix/.local/lib/python3.9/site-packages/geopy/distance.py", line 276, in init kilometers += self.measure(a, b) File "/home/mirix/.local/lib/python3.9/site-packages/geopy/distance.py", line 538, in measure a, b = Point(a), Point(b) File "/home/mirix/.local/lib/python3.9/site-packages/geopy/point.py", line 175, in new return cls.from_sequence(seq) File "/home/mirix/.local/lib/python3.9/site-packages/geopy/point.py", line 472, in from_sequence return cls(*args) File "/home/mirix/.local/lib/python3.9/site-packages/geopy/point.py", line 188, in new _normalize_coordinates(latitude, longitude, altitude) File "/home/mirix/.local/lib/python3.9/site-packages/geopy/point.py", line 57, in _normalize_coordinates latitude = float(latitude or 0.0) File "/home/mirix/.local/lib/python3.9/site-packages/pandas/core/generic.py", line 1534, in nonzero raise ValueError( ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Intriguingly, the same syntax works for other functions not using the geopy library.

Any ideas?

SOLUTION

There seems to be an intrinsic limitation with GeoPy's distance function which seems to only accept scalars.

The following workaround is based upon @SeaBen answer bellow:

df['lat_shift'] = df['latitude'].shift().fillna(df['latitude'])
df['lon_shift'] = df['longitude'].shift().fillna(df['longitude'])
df['alt_shift'] = df['altitude'].shift().fillna(df['altitude'])

df['distances'] = df.apply(lambda x: eukarney(x['latitude'], x['longitude'], x['altitude'], x['lat_shift'], x['lon_shift'], x['alt_shift']), axis=1).fillna(0)

what is this `distance.distance`? Does it accept `np.array` or just scalars/floats? — Quang Hoang, Oct 12 '21 at 19:33
distance.distance is from geopy import distance. https://geopy.readthedocs.io/en/stable/#module-geopy.distance — mirix, Oct 12 '21 at 19:37
Sorry, overlooked you need to use `shift()` values. Thus, using `.apply()` row-wise is not that possible in that way. — SeaBean, Oct 12 '21 at 19:43
@SeaBean Your solution works if I add the data as new columns. I was trying to avoid that, but it is the only workaround I was able to find. — mirix, Oct 12 '21 at 20:40
I modified my answer with the workaround. You can take it as a reference just in case no other better solution. — SeaBean, Oct 12 '21 at 21:01

SeaBean · Accepted Answer · 2021-10-14T14:15:15.800

1

You can use .apply() on each row, as follows:

Here, .apply() helps you pass the scalar values row by row to the custom function. Thus, enabling you to reuse your custom function which was designed to work on scalar values. Otherwise, you may need to modify your custom function to support vectorized array values of Pandas.

To cater for the .shift() entries, one workaround will be to define new columns for them first so that we can pass them to the .apply() function.

# Take previous entry by shift and `fillna` with original value for first row entry 
# (for in case the custom function cannot handle `NaN` entry on first row after shift)
df['lat_shift'] = df['latitude'].shift().fillna(df['latitude'])
df['lon_shift'] = df['longitude'].shift().fillna(df['longitude'])
df['alt_shift'] = df['altitude'].shift().fillna(df['altitude'])

df['distances'] = df.apply(lambda x: eukarney(x['latitude'], x['longitude'], x['altitude'], x['lat_shift'], x['lon_shift'], x['alt_shift']), axis=1).fillna(0)

edited Oct 14 '21 at 14:15

answered Oct 12 '21 at 19:36

SeaBean

22,547
3
13
25

1

Thank you. Indeed, the problem seems to be intrinsic to the geopy distance function. The apply method works if the shifted columns are created beforehand. – mirix Oct 14 '21 at 09:39
Thanks again @SeaBen. I have tested your workaround but it needs a couple of minor modifications in order to work. Please, have a look to the code in the edited question. – mirix Oct 14 '21 at 09:58
@mirix That's good you fine-tuned the codes. That's right we need to fillna for NaN values after shift. I also forgotten this coz didn't have tested it. Good that finally solved the problem. – SeaBean Oct 14 '21 at 10:37

Pandas dataframe : Applying function to row value and value from the previous row

1 Answers1