I am a beginner of Python. I have a dataset that contains people's traveling records in each time period and would like to get a new dataframe, that describes the choice set for each person when she travels.
I am trying to find all the stations that are within 5 km of a person from lat/lon coordinates. I have a dataframe that contains person-id, person location coordinates at time t, station coordinates. I would like to get a new dataframe containing all the stations that are within 5 km of the person that have appeared in the dataset (use person-id and time-t as two separate indices), and the respective distances to all of them as another column. For example, if station 1 has appeared in period 1, but not in 2, it is actually still there but just was not in the traveling records of people in time 2. This would generate a dataframe that describes the choice set for each person at time t (for example, a person's consideration set for which station to get gas for her car), as a person can move, but a station would always be available in the choice set after it is built. (Also note that although B did not go to any station at time 2, she still has a choice set of 5 and 6, as long as B has appeared in previous times. In other words, if a person has appeared, she would always be there. And that is why B showed again in time 2.)
import geopy.distance
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform
df = pd.DataFrame({
'time' : [1,1,2,2],
'personid' : ['A','B','A','C'],
'station' : [5,6,7,5],
'stationLoc' : [(122.286, 114.135),(122.284, 114.131),(122.286, 114.224),(122.286, 114.135)],
'personLoc' : [(122.283, 114.127),(122.283, 114.127),(122.286, 114.219),(122.286, 114.224)],
})
What I expect to get is like:
df1 = pd.DataFrame({
'personid' : ['A','A','A','A','A','B','B','B','B','C'],
'time' : [1,1,2,2,2,1,1,2,2,2],
'stations_within_5km' : [5, 6, 5, 6, 7, 5, 6, 5, 6, 7],
'distance' : [Ato5, Ato6, Ato5, Ato6, Ato7, Bto5, Bto6, Bto5, Bto6, Cto7],
})
I have tried to use a loop, but find it hard to implement this thought to get a standardized format of the data to run a regression. Sonia's answer is great, but it was posted when I did not make my statement clear. Sorry about this, but still appreciate it.
This is written in Python. But if R could work better, R code would also be welcome. Any thoughts would be appreciated.
Thank you very much!