4

I want to compute a cumsum on some column but reset the sum every time I reach some value.

I have read several questions regarding conditional reset on cumsum. They all involve some kind of other column that has the "reset value".

I am using the distance function of geopy to compute the distance from the first point (row 0) which produces

    lat lng all_distances
0   39.984198   116.319322  0.000000
12  39.984611   116.319822  62.663690
24  39.984252   116.320826  128.601760
36  39.983916   116.320980  145.036185
48  39.982688   116.321225  233.518640
60  39.981441   116.321305  349.856365
72  39.980291   116.321430  469.693983

But what I want is to compute the distance until I reach 200 and then compute the sum again but replace "first" point with the next point.

Here is a runnable MCVE so that its time can be compared with vectorized times.

import pandas as pd
from geopy.distance import distance
print(pd.__version__)

data = [[ 39.984198, 116.319322],
       [ 39.984611, 116.319822],
       [ 39.984252, 116.320826],
       [ 39.983916, 116.32098 ],
       [ 39.982688, 116.321225],
       [ 39.981441, 116.321305],
       [ 39.980291, 116.32143 ],
       [ 39.979675, 116.321805],
       [ 39.979546, 116.322926],
       [ 39.979758, 116.324513]]

user_gps_log = pd.DataFrame(data, columns=['lat', 'lng'])

first_lat = user_gps_log.iloc[0].lat
first_lng = user_gps_log.iloc[0].lng
all_distances = user_gps_log.apply(lambda x: distance((x.lat, x.lng), (first_lat, first_lng)).m, axis=1)

user_gps_log['all_distances'] = all_distances

p = user_gps_log
i = 0
dist_thres = 2

while i < len(p):
    j = i+1
    while j < len(p):
        dist = distance((p.iloc[i].lat, p.iloc[i].lng), (p.iloc[j].lat, p.iloc[j].lng)).m
        if dist > dist_thres:
            # do stuff
            i = j
            token = 1
        break
    j = j+1

EDIT UPDATE

Tried implementing using njit (can not avoid iterating..)

@njit
def cumsum_distance(lat, lng, limit=200):
    running_distance = 0
    first = (lat[0], lng[0])
    for i in range(lat.shape[0]):
        dist = distance(first, (lat[i], lng[i])).m
        running_distance += dist
        if running_distance > limit:
            yield i, running_distance
            running_distance = 0

runnig_distances = cumsum_distance(user_gps_log.lat.values, user_gps_log.lng.values, 200)

Getting this error:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'distance': cannot determine Numba type of <class 'type'>

File "<ipython-input-194-7214618c7e64>", line 6:
def cumsum_distance(lat, lng, limit=200):
    <source elided>
    for i in range(lat.shape[0]):
        dist = distance(first, (lat[i], lng[i])).m
        ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

Is it because I am using distance function of geopy? do I need to register a "type" the same as when using udaf in pyspark?

  • 1
    Here's a relevant post: https://stackoverflow.com/questions/54208023/can-i-perform-dynamic-cumsum-of-rows-in-pandas – ALollz Apr 27 '19 at 17:20
  • Thank you, but are we sure the loop can not be avoided? Also, Off topic, what's the deal with "pep8ing" the question? people get more reputation if I approved that they corrected my English but added nothing of importance to my question? – koren maliniak Apr 27 '19 at 17:56
  • 1
    Seems prettly likely that it looping is necessary. As for the edits, users below 2000 rep do get a small amount for providing edits. Grammatical/formatting improvements are substantive and should be approved. It's not an attack on you, simply a user trying to improve the site so that future users have an easier time solving problems. – ALollz Apr 27 '19 at 18:09
  • edit, new problem :) – koren maliniak Apr 27 '19 at 18:42

0 Answers0