I want to compute a cumsum on some column but reset the sum every time I reach some value.
I have read several questions regarding conditional reset on cumsum. They all involve some kind of other column that has the "reset value".
I am using the distance function of geopy to compute the distance from the first point (row 0) which produces
lat lng all_distances
0 39.984198 116.319322 0.000000
12 39.984611 116.319822 62.663690
24 39.984252 116.320826 128.601760
36 39.983916 116.320980 145.036185
48 39.982688 116.321225 233.518640
60 39.981441 116.321305 349.856365
72 39.980291 116.321430 469.693983
But what I want is to compute the distance until I reach 200 and then compute the sum again but replace "first" point with the next point.
Here is a runnable MCVE so that its time can be compared with vectorized times.
import pandas as pd
from geopy.distance import distance
print(pd.__version__)
data = [[ 39.984198, 116.319322],
[ 39.984611, 116.319822],
[ 39.984252, 116.320826],
[ 39.983916, 116.32098 ],
[ 39.982688, 116.321225],
[ 39.981441, 116.321305],
[ 39.980291, 116.32143 ],
[ 39.979675, 116.321805],
[ 39.979546, 116.322926],
[ 39.979758, 116.324513]]
user_gps_log = pd.DataFrame(data, columns=['lat', 'lng'])
first_lat = user_gps_log.iloc[0].lat
first_lng = user_gps_log.iloc[0].lng
all_distances = user_gps_log.apply(lambda x: distance((x.lat, x.lng), (first_lat, first_lng)).m, axis=1)
user_gps_log['all_distances'] = all_distances
p = user_gps_log
i = 0
dist_thres = 2
while i < len(p):
j = i+1
while j < len(p):
dist = distance((p.iloc[i].lat, p.iloc[i].lng), (p.iloc[j].lat, p.iloc[j].lng)).m
if dist > dist_thres:
# do stuff
i = j
token = 1
break
j = j+1
EDIT UPDATE
Tried implementing using njit (can not avoid iterating..)
@njit
def cumsum_distance(lat, lng, limit=200):
running_distance = 0
first = (lat[0], lng[0])
for i in range(lat.shape[0]):
dist = distance(first, (lat[i], lng[i])).m
running_distance += dist
if running_distance > limit:
yield i, running_distance
running_distance = 0
runnig_distances = cumsum_distance(user_gps_log.lat.values, user_gps_log.lng.values, 200)
Getting this error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'distance': cannot determine Numba type of <class 'type'>
File "<ipython-input-194-7214618c7e64>", line 6:
def cumsum_distance(lat, lng, limit=200):
<source elided>
for i in range(lat.shape[0]):
dist = distance(first, (lat[i], lng[i])).m
^
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
Is it because I am using distance function of geopy? do I need to register a "type" the same as when using udaf in pyspark?