The following section of my code is taking ages to run (it's the only loop in the function, so it's the most likely culprit):
tree = KDTree(x_rest)
for i in range(len(x_lost)):
_, idx = tree.query([x_lost[i]], k=int(np.sqrt(len(x_rest))), p=1)
y_lost[i] = mode(y_rest[idx][0])[0][0]
Is there a way to speed this up? I have a few suggestions from Stack Overflow:
- This answer suggests using Cython. I'm not particularly familiar with it, but I'm not very against it either.
- This answer uses a multiprocessing Pool. I'm not sure how useful this will be: my current execution takes over 12h to run, and I'm hoping for at least a 5-10x speedup (though that may not be possible).