3

I have a csv dataset that looks like this:

    created_date,latitude,longitude
"2018-10-02 16:52:54",20.56314546,-100.40871983
"2018-10-07 18:06:37",20.56899227,-100.40879701
"2018-10-08 11:55:31",20.57479211,-100.39687493
"2018-10-08 11:55:31",20.58076244,-100.36075875
"2018-10-08 11:55:31",20.60529101,-100.40951731
"2018-10-08 11:55:31",20.60783806,-100.37852743
"2018-10-09 18:10:00",20.61098901,-100.38008197
"2018-10-09 18:10:00",20.61148848,-100.40851908
"2018-10-09 18:10:00",20.61327334,-100.34415272
"2018-10-09 18:10:00",20.61397514,-100.33583425

I am trying to use pandas to separate the data into groups by date and would then like to iterate through every group and calculate the distance between the lat,longs in every group using the haversine function which takes 2 coords as params.

In order to do so i have to calculate the distance of say coord1 with coord2, coord 2 with coord 3 and so on (from the group)

I want to do this in order to calculate the average distance traveled. I'd then have to add the distances together and divide it by the number of groups.

With pandas i managed to divide my data into groups but im not sure how to iterate through these groups, while excluding the groups (say "2018-10-02 16:52:54" ) which doesn't have 2 coords to calculate distance from.

My current python script looks like this:

col_names = ['date', 'latitude', 'longitude']
data = pd.read_csv('dataset.csv', names=col_names, sep=',', skiprows=1)
grouped = data.groupby('date')
for index, item in grouped:

Any guidance is appreciated, i have a general idea of how to do it but i'm not sure if tools like zip can help me through this.

Luis
  • 305
  • 1
  • 14
  • 1
    use groupby with lambda. and define a function that can be used to define the distance – mad_ Oct 15 '18 at 16:29
  • 1
    @mad_'s comment is the correct approach. If you could provide your function definition (even a crude definition) w/ code, then we can help you fine tune the solution. – rahlf23 Oct 15 '18 at 16:44

1 Answers1

2

Here's one option. It involves performing an enormous merge within groups, giving all pairwise combinations. Then remove all same row merges and you can calculate the distances once.

import pandas as pd
import numpy as np

def haversine(lon1, lat1, lon2, lat2):
    # convert degrees to radians 
    lon1 = np.deg2rad(lon1)
    lat1 = np.deg2rad(lat1)
    lon2 = np.deg2rad(lon2)
    lat2 = np.deg2rad(lat2)

    # formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r_e = 6371 
    return c * r_e

Code:

# merge
m = df.reset_index().merge(df.reset_index(), on='created_date')

# remove comparisons of the same event
m = m[m.index_x != m.index_y].drop(columns = ['index_x', 'index_y'])

# Calculate Distance
m['Distance'] = haversine(m.longitude_x, m.latitude_x, m.longitude_y, m.latitude_y)

Output: m

           created_date  latitude_x  longitude_x  latitude_y  longitude_y  Distance
3   2018-10-08 11:55:31   20.574792  -100.396875   20.580762  -100.360759  3.817865
4   2018-10-08 11:55:31   20.574792  -100.396875   20.605291  -100.409517  3.637698
5   2018-10-08 11:55:31   20.574792  -100.396875   20.607838  -100.378527  4.141211
...
30  2018-10-09 18:10:00   20.613975  -100.335834   20.610989  -100.380082  4.617105
31  2018-10-09 18:10:00   20.613975  -100.335834   20.611488  -100.408519  7.569825
32  2018-10-09 18:10:00   20.613975  -100.335834   20.613273  -100.344153  0.869261

To get the average per date:

m.groupby('created_date').Distance.mean()

#created_date
#2018-10-08 11:55:31    4.021623
#2018-10-09 18:10:00    4.411060
#Name: Distance, dtype: float64

As we subset the merged DataFrame before, this will only provide output for created_dates with more than 1 measurement.


To merge on date instead of an exact time:

df['created_date'] = pd.to_datetime(df.created_date)
df['ng'] = df.groupby(df.created_date.dt.date).ngroup()

m = df.reset_index().merge(df.reset_index(), on='ng')
m = m[m.index_x != m.index_y].drop(columns = ['index_x', 'index_y'])

...
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Thanks, that helped me a lot. However, i can't seem to get a daily avergae if i dont separate my data by day instead of by timestamp. Any idea on how i would do this? I imagine i'd have to create a new index where the merge is done? – Luis Oct 15 '18 at 20:07
  • @Louis To merge based on `date` and not the timestamp, I'd first convert to a datetime: `df['created_date'] = pd.to_datetime(df.created_date)` and then you can create a group number: `df['ng'] = df.groupby(df.created_date.dt.date).ngroup()`, and then you should be able to just merge `on='ng'` following exactly above. Make sure to reset the index on the merge, and then drop where the indices are equal after. – ALollz Oct 15 '18 at 20:19
  • You may consider defining a new 'Date' column, which is just the date component of either `created_date_x` or `created_date_y` – ALollz Oct 15 '18 at 20:28
  • @ALollz is this in miles or kms? Or do I need another function to convert the output from this to one of those measurements? Preferrably miles? – Emm Sep 20 '20 at 09:40