2

I want to optimize a piece of code that helps me to calculate a nearest neighbour for every item in a given dataset with 100k rows. The dataset contains 50 variable-columns, which helps to describe each row-item and most of cells contains a probability value between 0 - 1.

Question: I am fairly new to python, but would like to know if there are more advanced users who could recommend any better structure in the code below which would help me to speed up this calculation. Currently, the program takes a very long time to complete. Thanks in advance!

import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
from sklearn.neighbors import KNeighborsRegressor

df_set = pd.read_excel('input.xlsx', skiprows=0)

distance_columns = ["var_1",
                    ......,
                    ......,
                    ......,
                    "var_50"]

def euclidean_distance(row):
    inner_value = 0
    for k in distance_columns:
        inner_value += (row[k] - selected_row[k]) ** 2
    return math.sqrt(inner_value)

knn_name_list = []

for i in range(len(df_set.index)):

    numeric = df_set[distance_columns]
    normalized = (numeric - numeric.mean()) / numeric.std()
    normalized.fillna(0, inplace=True)
    selected_normalized = normalized[df_set["Filename"] == df_set["Filename"][i]]

    euclidean_distances = normalized.apply(lambda row: distance.euclidean(row, selected_normalized), axis=1)

    distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
    distance_frame.sort_values("dist", inplace=True)

    second_smallest = distance_frame.iloc[1]["idx"]
    most_similar_to_selected = df_set.loc[int(second_smallest)]["Filename"]

    knn_name_list.append(most_similar_to_selected)

    print(knn_name_list)


df_set['top_neighbor'] = np.array(knn_name_list)

df_set.to_csv('output.csv', encoding='utf-8', sep=',', index=False)
d_-
  • 1,391
  • 2
  • 19
  • 37
  • 1
    Have you done any measurements yet where the bottlenecks are? This can be as simple as measuring the time by printing the output of ```time.perf_counter()``` with some labels to see where the performance tanks. This might turn out to be just the library functions, at which point measuring performance becomes a bit trickier, but these simple tests can lead to surprising results and simple fixes to improve performance. – Etienne Ott Sep 24 '19 at 14:20
  • Great suggestion, I will try to do this today and update my post if I get any new insights. Thanks @Etienne! – d_- Sep 24 '19 at 14:54
  • I guess you don't want to use packages right (sklearn, scipy etc)? – Ilker Kurtulus Sep 25 '19 at 14:08
  • No, please recommend answers with packages also. I will test the speed of both approaches. – d_- Sep 25 '19 at 15:17

2 Answers2

1

I would recommand to Use NearestNeighbors. (set n_jobs to -1 to use all processors)

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize

#Loading data
df_set = ...

#Selecting numerical data
numeric = df_set[distance_columns]

#normelizing
normalized = normalize(numeric, norm='l1', axis=1, copy=True, return_norm=False)

#Initializing NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='euclidean', n_jobs=-1)

#Fitting with normilized data
neigh.fit(normalized)
...
second_smallest = ...

#Getting Most similar to your selected data
most_similar_to_selected = neigh.kneighbors(second_smallest)
Amine Benatmane
  • 1,191
  • 1
  • 8
  • 15
  • Any ideas about how to integrate the for loop into your answer? It is important to loop over each row in the dataset and recompute the nearest neighbor for each item in the dataset. – d_- Sep 25 '19 at 15:19
  • You can do that with kneighbors_graph method: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors.kneighbors_graph – Amine Benatmane Sep 25 '19 at 15:44
0

To give you another idea to @Amine approach, you can also include PCA Transformationinto it ( https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

This would work like this:

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA 
#Loading data
df_set = ...

#Selecting numerical data
numeric = df_set[distance_columns]

#normelizing
normalized = normalize(numeric, norm='l1', axis=1, copy=True, return_norm=False)
#reduce number of compononents (here to 25)
pca = PCA(n_components=25)
pca.fit(normalized)  
pcanormalized = pca.fit_transform(normalized)               
#Initializing NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='euclidean', n_jobs=-1)

#Fitting with normilized data
neigh.fit(pcanormalized)
...
second_smallest = ...

#Getting Most similar to your selected data
most_similar_to_selected = neigh.kneighbors(second_smallest)
PV8
  • 5,799
  • 7
  • 43
  • 87
  • Thanks for the answer. I will also try to execute this code to see the speed. Any ideas about how to integrate the for loop (to take into consideration each item in the dataset as described in my question) into your answer? – d_- Sep 25 '19 at 15:20