1

I already read answers and blog entries about how to iterate pandas.DataFrame efficient (https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6), but i still have one question left.

Currently, my DataFrame represents a GPS trajectory containing the columns time, longitude and latitude. Now, I want to calculate a feature called distance-to-next-point. Therefore, i not only have to iterate through the rows and doing operations on the single rows, but have to access subsequent rows in a single iteration.

i=0
for index, row in df.iterrows():
    if i < len(df)-1:
        distance = calculate_distance([row['latitude'],row['longitude']],[df.loc[i+1,'latitude'],df.loc[i+1,'longitude']])
        row['distance'] = distance

Besides this problem, I have the same issue when calculating speed, applying smoothing or other similar methods.

Another example: I want to search for datapoints with speed == 0 m/s and outgoing from these points I want to add all subsequent datapoints into an array until the speed reached 10 m/s (to find segments of accelerating from 0m/s to 10m/s).

Do you have any suggestions on how to code stuff like this as efficient as possbile?

  • 1
    Using df and df.shift() – BENY Nov 26 '18 at 15:19
  • 1
    You may want to investigate using shift (https://stackoverflow.com/questions/22081878/get-previous-rows-value-and-calculate-new-column-pandas-python) and avoid iterating – Hotpepper Nov 26 '18 at 15:28

1 Answers1

2

You can use pd.DataFrame.shift to add shifted series to your dataframe, then feed into your function via apply:

def calculate_distance(row):
    # your function goes here, trivial function used for demonstration
    return sum(row[i] for i in df.columns)

df[['next_latitude', 'next_longitude']] = df[['latitude', 'longitude']].shift(-1)
df.loc[df.index[:-1], 'distance'] = df.iloc[:-1].apply(calculate_distance, axis=1)

print(df)

   latitude  longitude  next_latitude  next_longitude  distance
0         1          5            2.0             6.0      14.0
1         2          6            3.0             7.0      18.0
2         3          7            4.0             8.0      22.0
3         4          8            NaN             NaN       NaN

This works for an arbitrary function calculate_distance, but the chances are your algorithm is vectorisable, in which case you should use column-wise Pandas / NumPy methods.

jpp
  • 159,742
  • 34
  • 281
  • 339