0

The following section of my code is taking ages to run (it's the only loop in the function, so it's the most likely culprit):

tree = KDTree(x_rest)
for i in range(len(x_lost)):
    _, idx = tree.query([x_lost[i]], k=int(np.sqrt(len(x_rest))), p=1)
    y_lost[i] = mode(y_rest[idx][0])[0][0]

Is there a way to speed this up? I have a few suggestions from Stack Overflow:

Rahul
  • 1,056
  • 2
  • 9
  • 26
  • I can't believe I didn't think of that. So would calling `tree.query(x_lost, k=...)` give me an array of `idx`? Perhaps the last line could use a list comprehension then. – Rahul Feb 05 '23 at 00:15
  • That's great, thanks a lot! Please add it as an answer and I'll accept it. – Rahul Feb 05 '23 at 02:29

1 Answers1

2

Here are a few notes about how you could speed this up:

  1. This code loops over x_rest, and calls tree.query() with one point from x_rest at a time. However, query() supports querying multiple points at once. The loop inside query() is implemented in Cython, so I would expect it to be much faster than a loop written in Python. If you call it like this, it will return an array of matches.

  2. The query() function supports a parameter called workers, which if set to a value larger than one, runs your query in parallel. Since workers is implemented using threads, it will likely be faster than a solution using multiprocessing.Pool, since it avoids pickling. See the documentation.

  3. The code above doesn't define the mode() function, but I'm assuming it's scipy.stats.mode(). If that's the case, rather than calling mode() repeatedly, you can use the axis argument, which would let you take the mode of nearby points for multiple queries at once.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • This is exactly what I did! I rewrote the function in Cython, moved the `query` to a batch call, and used `scipy.stats.mode` with `axis=1`. I couldn't set `workers` sadly, since the computer cluster I'm using is on an older version of scipy, but doing this got me a 50x speedup, which was enough. – Rahul Feb 06 '23 at 03:42