I want to optimize a piece of code that helps me to calculate a nearest neighbour for every item in a given dataset with 100k rows. The dataset contains 50 variable-columns, which helps to describe each row-item and most of cells contains a probability value between 0 - 1.
Question: I am fairly new to python, but would like to know if there are more advanced users who could recommend any better structure in the code below which would help me to speed up this calculation. Currently, the program takes a very long time to complete. Thanks in advance!
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
from sklearn.neighbors import KNeighborsRegressor
df_set = pd.read_excel('input.xlsx', skiprows=0)
distance_columns = ["var_1",
......,
......,
......,
"var_50"]
def euclidean_distance(row):
inner_value = 0
for k in distance_columns:
inner_value += (row[k] - selected_row[k]) ** 2
return math.sqrt(inner_value)
knn_name_list = []
for i in range(len(df_set.index)):
numeric = df_set[distance_columns]
normalized = (numeric - numeric.mean()) / numeric.std()
normalized.fillna(0, inplace=True)
selected_normalized = normalized[df_set["Filename"] == df_set["Filename"][i]]
euclidean_distances = normalized.apply(lambda row: distance.euclidean(row, selected_normalized), axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_selected = df_set.loc[int(second_smallest)]["Filename"]
knn_name_list.append(most_similar_to_selected)
print(knn_name_list)
df_set['top_neighbor'] = np.array(knn_name_list)
df_set.to_csv('output.csv', encoding='utf-8', sep=',', index=False)