0

Need help with efficient python code(using pandas) to find which vehicle at what time passed closest to incident_sw =(35.7158, -120.7640). I'm having trouble formulating a Euclidean distance to sort through below df and print which vehicle and its corresponding time are closest to incident_sw. All times are HH:MM:SS.SS (assume below times are hour 12).

My time conversion function--

def time_convert(str_time):                                                   
values = str_time.split(':')                                                         
mins = 60*(float(values[0]) - 12) + float(values[1]) + 1.0/60 * float(values[2])     
mins = round(mins, 4)                                                                
return mins    

My csv dataframe--

vehicle time    lat[D.DDD]  lon[D.DDD]
veh_1   17:19.5 35.7167809  -120.7645652
veh_1   17:19.5 35.7167808  -120.7645652
veh_1   17:19.7 35.7167811  -120.7645648
veh_1   17:20.1 35.7167812  -120.7645652
veh_2   17:20.4 35.7167813  -120.7645647
veh_2   17:20.7 35.7167813  -120.7645646
veh_3   17:22.6 35.7167807  -120.7645651
veh_3   17:23.4 35.7167808  -120.7645652
veh_4   17:24.1 35.7167803  -120.7645653
veh_4   17:25.0 35.7167806  -120.7645658
veh_5   17:25.9 35.7167798  -120.7645659
veh_5   17:26.6 35.7167799  -120.7645658
Cœur
  • 37,241
  • 25
  • 195
  • 267
CatLady
  • 31
  • 6
  • so you want to find a better way to calculate (lat, long) distance? – linpingta Nov 22 '16 at 04:48
  • @linpingta That is one part of it. I specifically need to formulate a code to return which vehicle at what time passed closest to incident_sw =(35.7158, -120.7640), which is a separate variable from the csv data. – CatLady Nov 22 '16 at 04:51
  • so time is an input parameter, and you want veh_id as function output? If you need time accurate equal, then it means filter operation on dataframe, is that right? – linpingta Nov 22 '16 at 04:58
  • First, I need to calculate distance between all lat/lon in csv df compared to incident_sw lat/lon. Output is both veh_id and time for the lat/lon that is closest in distance to incident_sw – CatLady Nov 22 '16 at 05:08
  • This question might help: http://stackoverflow.com/q/38082936/3765319 – Kartik Nov 22 '16 at 05:37
  • @Kartik, I have a hard time understanding how to apply vectorization. Trying to keep it as simply as possible. – CatLady Nov 22 '16 at 05:53
  • Why is it hard? It is quite simple. The alternative, IMHO is harder, because you have to code a loop yourself, which just becomes ugly. Think of [`np.vectorize`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html) as a loop. It will just loop over whatever you pass, repeatedly calling the function inside. – Kartik Nov 22 '16 at 06:19
  • The concept seems easier; learning how to implement it is another! Still learning the basics. I think it was that question you referenced...it was a confusing example to me. – CatLady Nov 22 '16 at 06:33

1 Answers1

0

So, at the outset, I would recommend you use a library like Geopy to do the heavy lifting of calculating the distances between points. Secondly, I would recommend using GeoPandas to store geographic information. More on that later.

Assuming your distances function is called distance (you code it yourself, or get it from Geopy, as you prefer), this will help speed up things for you somewhat. Note that the below implementation is still a loop, even though it uses vectorize from numpy library. Also, the below is pseudo-code, and you will have to modify it to work for you.

import numpy as np

def dist_calc(point, list_of_points):
    dist = np.vectorize(lambda x: distance(point, x))
    return dist(list_of_points)

# Now you can call it simply using:
df['points'] = list(zip(df['lat'], df['lon']))
df.groupby('vehicle')['points'].transform(dist_calc, point=incident_sw)

Reasons for recommending GeoPandas is simple. If you have a huge number of points to search from, say each vehicle leaves a trail of points every minute or second, then the above answer will take a long time to compute. If you are storing your data in a GeoPandas, then you can use the buffer and intersects tools in GeoPandas to limit the search space around your incidents. In that case, you will build a reasonable size buffer around your incidents, and only search for those vehicle points that fall inside that buffer. That will help speed up your code.

I would recommend you take a day to familiarize yourself with all the capabilities of GeoPandas before proceeding.


Using great_circle from geopy

from geopy import great_circle
import numpy as np

def dist_calc(point, list_of_points):
    dist = np.vectorize(lambda x: great_circle(point, x).meters)
    return dist(list_of_points)

# Now you can call it simply using:
df['points'] = list(zip(df['lat'], df['lon']))
df['distances'] = df.groupby('vehicle')['points'].transform(dist_calc, point=incident_sw)
Kartik
  • 8,347
  • 39
  • 73
  • the incident_sw is only that one given lat/lon and not a column in another df. I coded a distance formula, but the inputs are lat2, lon2. Lat1 and lon1 are defined with the incident_sw values. Are you creating a 'points' column in the above? Not sure where that is coming from. – CatLady Nov 22 '16 at 06:29
  • You did not include that you had a distance function. You actually said that you had trouble formulating Euclidean distance computation. Hence I recommended using Geopy. But then a simple fix would be to change your distance function to accept `tuples`. Something like: `def myfunc(point2, point1): lat2, lon2, lat1, lon1 = point2, point1` should do the trick... – Kartik Nov 22 '16 at 08:05
  • The reason it is better to pass the latitude and longitude pairs as `tuples` is because `tuples` are not mutable. The two values will stay together. Especially, if you want to use `np.vectorize` to move things along a bit faster, then it is important to pass them as a list or array of `tuples`, else, your values may be split. In any case, handling code with fewer variables, IMHO, is more maintainable and easier. – Kartik Nov 22 '16 at 08:11
  • I decided to import geopy for great circle distance. Not sure how your pseudcode would work with geopy as the distance formula. Also, getting a type errors with regards to the use of "point" TypeError: dist_calc() got multiple values for argument 'point' – CatLady Nov 22 '16 at 10:04
  • See the edit in the answer. It should be pretty clear how to go from here. – Kartik Nov 22 '16 at 21:02
  • The `incident_sw` should be a `tuple` of (lat, lon) when you send it to `dist_calc`. Like in your question. That should not raise a `multiple value` error. – Kartik Nov 22 '16 at 21:04