-2

I am a beginner of Python. I have a dataset that contains people's traveling records in each time period and would like to get a new dataframe, that describes the choice set for each person when she travels.

I am trying to find all the stations that are within 5 km of a person from lat/lon coordinates. I have a dataframe that contains person-id, person location coordinates at time t, station coordinates. I would like to get a new dataframe containing all the stations that are within 5 km of the person that have appeared in the dataset (use person-id and time-t as two separate indices), and the respective distances to all of them as another column. For example, if station 1 has appeared in period 1, but not in 2, it is actually still there but just was not in the traveling records of people in time 2. This would generate a dataframe that describes the choice set for each person at time t (for example, a person's consideration set for which station to get gas for her car), as a person can move, but a station would always be available in the choice set after it is built. (Also note that although B did not go to any station at time 2, she still has a choice set of 5 and 6, as long as B has appeared in previous times. In other words, if a person has appeared, she would always be there. And that is why B showed again in time 2.)

import geopy.distance
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform

df = pd.DataFrame({        
    'time' : [1,1,2,2],                               
    'personid' : ['A','B','A','C'],      
    'station' : [5,6,7,5],
    'stationLoc' : [(122.286, 114.135),(122.284, 114.131),(122.286, 114.224),(122.286, 114.135)],     
    'personLoc' : [(122.283, 114.127),(122.283, 114.127),(122.286, 114.219),(122.286, 114.224)],                          
    })

What I expect to get is like:

df1 = pd.DataFrame({        
    'personid' : ['A','A','A','A','A','B','B','B','B','C'], 
    'time' : [1,1,2,2,2,1,1,2,2,2],                                    
    'stations_within_5km' : [5, 6, 5, 6, 7,  5, 6, 5, 6, 7], 
    'distance' : [Ato5, Ato6, Ato5, Ato6, Ato7,  Bto5, Bto6, Bto5, Bto6, Cto7],                
    })

I have tried to use a loop, but find it hard to implement this thought to get a standardized format of the data to run a regression. Sonia's answer is great, but it was posted when I did not make my statement clear. Sorry about this, but still appreciate it.

This is written in Python. But if R could work better, R code would also be welcome. Any thoughts would be appreciated.

Thank you very much!

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Jacob2309
  • 51
  • 7
  • You have mentioned "all the stations that are within 5 km of the person at AND before time t " . In that case, are you looking for two distances? like one before t and one at t? The question and the output given doesnt match. Where does the output show "distance before t" and "distance after t"? – Sonia Samipillai Sep 04 '21 at 14:54
  • Hi Sonia. Thank you so much for your kind reply. It is that I found the previous description would get me a result that would miss some values. Sorry for the previous incorrect description. Regarding distance, I mean, all the stations, as long as they have appeared, they would be taken into the consideration set, even if the station is not shown in the original traveling records at time t. So there would not be two distances from the same station before and after t, as long as the person stays in the same place. Thank you Sonia. ;) – Jacob2309 Sep 04 '21 at 15:00
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 08 '21 at 19:24

1 Answers1

1

Use the Haversine distance to compute the distance between two coordinates. Haversine formula is given below

def haversine_distance(point1,point2):
    '''
    Takes point1 and point2 and calculates the haversine distance 
    between the two points.
    
    Input Parameters
    ----------------
    point1 and point2 as latitude and longitude coordinates in tuples
    
    For e.g.,
    
    point1_coords = (49.012798, 2.550000)
    point2_coords = (-43.489399, 172.531998)
    
    Output Parameters
    -----------------
    Distance in Km
    
    Notes
    -----
    The Haversine (or great circle) distance is the angular distance 
    between two points on the surface of a sphere.
    
    '''
    lat1,lon1 = point1
    lat2,lon2 = point2

    # φ1, φ2 are the latitude of point 1 and latitude of point 2 (in radians)
    phi1 = math.radians(lat1)
    phi2 = math.radians(lat2)
    # λ1, λ2 are the longitude of point 1 and longitude of point 2 (in radians).
    lambda1 = math.radians(lon1)
    lambda2 = math.radians(lon2)

    delta_phi = phi2-phi1
    delta_lambda = lambda2-lambda1

    # Calculating a
    a = math.sin(delta_phi/2.0)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(delta_lambda/2.0)**2
    
    # Calculating c
    c = 2*math.atan2(math.sqrt(a),math.sqrt(1-a))

    R = 6371  # radius of Earth in kilometers
    
    # Calculating Distance
    d = R*c # d is the distance
    
    return d

Then calculate the distance between the person loc and all the station locations as shown below:

df['uniq_id'] = df['time'].astype(str) +df['personid'] #Created uniq_id   

def stations_within_5km(uniqid_arr, personloc_arr, stationloc_arr, station_arr):
        distance = {}
        stations_5 = {}
        for i in range(len(personloc_arr)):
            dist = []
            stations_ = []
            for j in range(len(personloc_arr)):
                if haversine_distance(personloc_arr[i],stationloc_arr[j]) <= 5:
                    if station_arr.iloc[j] not in stations_:
                        dist.append(haversine_distance(personloc_arr[i],stationloc_arr[j]))
                        stations_.append(station_arr.iloc[j])
    
            distance[uniqid_arr.iloc[i]] = dist
            stations_5[uniqid_arr.iloc[i]] = stations_
        return distance, stations_5
    
distance, stations_5 = stations_within_5km(df['uniq_id'],df['personLoc'],df['stationLoc'],df['station'])
df['stations_within_5km'] = [stations_5[id_] for id_ in df['uniq_id']]
df['distance'] = [distance[id_] for id_ in df['uniq_id']]

Output:

    time    personid    station stationLoc  personLoc   uniq_id stations_within_5km distance
0   1   A   5   (122.286, 114.135)  (122.283, 114.127)  1A  [5, 6]  [0.580544416674241, 0.26229648732200417]
1   1   B   6   (122.284, 114.131)  (122.283, 114.127)  1B  [5, 6]  [0.580544416674241, 0.26229648732200417]
2   2   A   7   (122.286, 114.224)  (122.286, 114.219)  2A  [5, 7]  [4.98912110882879, 0.2969715135144607]
3   2   C   5   (122.286, 114.135)  (122.286, 114.224)  2C  [7]     [0.0]

The calculated fields show the stations(duplicates removed) within 5km to the person and their respective distances.

Sonia Samipillai
  • 590
  • 5
  • 15
  • 1
    Hello Sonia. Thank you for your reply. It looks nice. However, may I know how to create this uniq_id based on time and personid? If the data size increases, we might not be able to combine them manually. Thanks. – Jacob2309 Sep 04 '21 at 11:26
  • Simply combined time and personid . df['uniq_id'] = df['time'].astype(str) +df['personid'] – Sonia Samipillai Sep 04 '21 at 11:30