0

I have a very large DataFrame with coordinates. Let's take the following example:

df = pd.DataFrame({
'Buyer': 'Carl Mark Carl Joe Mark Carl'.split(),
'Quantity': [5,2,5,10,1,5],
'Lat':[50.111, 48.777, 50.111, 52.523, 48.777, 50.111],
'Lng' : [8.6805, 9.1807, 8.6805, 13.411, 9.1807, 8.6805],
'Date' : [
    DT.datetime(2013,1,1,13,0),
    DT.datetime(2013,1,1,13,5),
    DT.datetime(2013,1,1,20,0),
    DT.datetime(2013,2,6,10,0),
    DT.datetime(2013,2,6,12,0),                                      
    DT.datetime(2013,2,6,14,0),
    ]})

import geopy

df['Point'] = df.apply(lambda row: geopy.Point(row['Lat'], row['Lng']), axis=1)

Based on this DataFrame I need to calculate the distance between points various times. Often the points which need to be compared with each other are the same, for example when I want to calculate the distance from Carl to all other Buyers each day.

def dis_calc(df):
    p = geopy.Point(50.111,8.6805)
    sum = 0.0
    for i, row in df.iterrows():
        dist = geopy.distance.distance(p, row['Point']).km
        sum = sum + dist
    return sum


gr = df.groupby(df.Date.map(lambda d: d.date()))
gr.apply(dis_calc)

To do this efficiently and not having to calculate the same distances multiple times, I am hoping to build an adjacency matrix with the Buyers and their distances to each other. As a result I could query this matrix instead of the doing the distance calculations.

Something in the following way:

     | Carl | Mark | Joe 
----------------------
Carl |10 km | 5km  | 10km
Mark |      | 20km | 15km 
Joe  |      |      | 25km   

What would you recommend as data structure for this adjacency matrix and how would you implement the lookup so that it is faster than a dedicated distance computation.

I would deeply appreciate any help.

Andy

Andy
  • 9,483
  • 12
  • 38
  • 39
  • It looks like you want an adjacency matrix for each day. I suggest a dictionary in which keys are dates and values are DataFrames, where each axis lists the customers. Alternatively you could look into the pandas' Panel objects. – Dan Allan May 29 '13 at 17:51
  • Hi Dan, I am afraid you got me wrong. My idea is rather that I have an adjacency matrix between the Buyers which has in each cell the distance between them. I updated my problem description. I would deeply appreciate to get your opinion about that. – Andy May 29 '13 at 20:57
  • Yes, what you've shown is what I meant by "DateFrames, where each axis list the customers." – Dan Allan May 30 '13 at 12:13

0 Answers0