I have a data frame that contains a group ID, two distance measures (longitude/latitude type measure), and a value. For a given set of distances, I want to find the number of other groups nearby, and the average values of those other groups nearby.
I've written the following code, but it is so inefficient that it simply does not complete in a reasonable time for very large data sets. The calculation of nearby retailers is quick. But the calculation of the average value of nearby retailers is extremely slow. Is there a better way to make this more efficient?
distances = [1,2]
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
columns=['Group','Dist1','Dist2','Value'])
# get one row per group, with the two distances for each row
df_groups = df.groupby('Group')[['Dist1','Dist2']].mean()
# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])
# find points within a given radius
for i in distances:
closeby = tree.query_ball_tree(tree, r=i)
# put into density column
df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
# get average values of nearby groups
for idx, val in enumerate(df_groups.index):
val_idx = df_groups.iloc[closeby[idx]].index.values
mean = df.loc[df['Group'].isin(val_idx), 'Value'].mean()
df_groups.loc[val, str(i) + '_mean_values'] = mean
# merge back to dataframe
df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
str(i) + '_mean_values']],
left_on='Group',
right_index=True)