Significant delay between python function completion and return

Question

I've come across an issue I can't seem to diagnose, where a Python function doesn't return until a long time after it actually completes. A minimum example (edited to be simpler):

from sklearn.neighbors import KDTree
import time

def mwe(n):
    start = time.time()
    a = KDTree([[1]]).query_radius([[1]]*int(n), 1)
    print(f'Function completed in: {time.time()-start:.2f} seconds')

start = time.time()
mwe(1e6)
print(f'Function returned after {time.time()-start:.1f} seconds')

start = time.time()
mwe(1e7)
print(f'Function returned after {time.time()-start:.1f} seconds')

Output:

Function completed in: 0.36 seconds
Function returned after 4.4 seconds

Function completed in: 3.75 seconds
Function returned after 44.0 seconds

At first I thought this could be a garbage collection issue, but adding gc.disable() doesn't have any effect. As well, if it were garbage collection, I would expect it to appear prominently in the profile output—but cProfile doesn't even recognize the problem exists:

         645 function calls (636 primitive calls) in 2.701 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    1.069    0.534    1.076    0.538 {method 'query_radius' of 'sklearn.neighbors._ball_tree.BinaryTree' objects}
    32/23    0.919    0.029    1.059    0.046 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.570    0.570    2.701    2.701 test_return_delay.py:6(test_loop)
        4    0.135    0.034    0.135    0.034 {method 'repeat' of 'numpy.ndarray' objects}
        7    0.004    0.001    0.004    0.001 {method 'reduce' of 'numpy.ufunc' objects}
        5    0.002    0.000    0.002    0.000 {built-in method numpy.arange}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        2    0.000    0.000    0.246    0.123 index_tricks.py:322(__getitem__)
        4    0.000    0.000    0.005    0.001 validation.py:96(_assert_all_finite)
        3    0.000    0.000    0.000    0.000 function_base.py:23(linspace)
        4    0.000    0.000    0.005    0.001 validation.py:629(check_array)
        4    0.000    0.000    0.000    0.000 _array_api.py:168(_asarray_with_order)
        4    0.000    0.000    0.000    0.000 numerictypes.py:573(_can_coerce_all)
        4    0.000    0.000    0.000    0.000 validation.py:320(_num_samples)
       47    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        4    0.000    0.000    0.135    0.034 <__array_function__ internals>:177(repeat)
...

If I instead run cProfile as a module over the entire script (i.e., python -m cProfile -s time test_return_delay.py), it does at least show there's an issue, but it doesn't seem to give any indication where it's coming from as far as I can tell:

         788549 function calls (774939 primitive calls) in 12.372 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   11.257   11.257   11.257   11.257 {method 'enable' of '_lsprof.Profiler' objects}
      808    0.173    0.000    0.173    0.000 {built-in method io.open_code}
     3514    0.124    0.000    0.124    0.000 {built-in method nt.stat}
      157    0.094    0.001    0.097    0.001 {built-in method _imp.create_dynamic}
      808    0.076    0.000    0.076    0.000 {built-in method marshal.loads}
      808    0.045    0.000    0.045    0.000 {method 'read' of '_io.BufferedReader' objects}
     6273    0.033    0.000    0.051    0.000 <frozen importlib._bootstrap_external>:96(_path_join)
   1066/1    0.025    0.000   12.373   12.373 {built-in method builtins.exec}
      391    0.025    0.000    0.067    0.000 doccer.py:13(docformat)
1701/1692    0.020    0.000    0.086    0.000 {built-in method builtins.__build_class__}
  157/126    0.016    0.000    0.084    0.001 {built-in method _imp.exec_dynamic}
      399    0.011    0.000    0.019    0.000 doccer.py:180(indentcount_lines)
     1359    0.011    0.000    0.161    0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
     9043    0.009    0.000    0.009    0.000 {method 'splitlines' of 'str' objects}
  499/100    0.009    0.000    0.025    0.000 sre_parse.py:494(_parse)
      809    0.009    0.000    0.009    0.000 {method '__exit__' of '_io._IOBase' objects}
    17871    0.009    0.000    0.009    0.000 {built-in method builtins.getattr}
      808    0.008    0.000    0.367    0.000 <frozen importlib._bootstrap_external>:950(get_code)
   100236    0.008    0.000    0.008    0.000 {method 'append' of 'list' objects}
 1013/842    0.007    0.000    0.337    0.000 <frozen importlib._bootstrap>:921(_find_spec)
   1040/2    0.007    0.000    1.116    0.558 <frozen importlib._bootstrap>:1022(_find_and_load)
     9075    0.007    0.000    0.007    0.000 {method 'expandtabs' of 'str' objects}
      671    0.007    0.000    0.020    0.000 inspect.py:2280(_signature_from_function)
        1    0.006    0.006    0.006    0.006 memory.py:371(MemorizedFunc)
...

Where might the issue lie, or how can I further debug?

I've simplified the example and raised the issue on scikit-learn's repository. The fix seems to be to have KDTree (and the other nearest neighbor classes) generate nested lists like scipy.spatial.KDTree does, rather than the current method of using arrays as the nested object.

It pretty much has to be garbage collection; as far as I know, the deletion of local variables is the only variable-time operation happening between the last line of a function and the next line of the caller. Note that `gc.disable()` only affects the *cyclic* garbage collector, not Python's normal reference counting mechanism. To test: time an explicit `del` of the local variables of the function, one at a time. — jasonharper, Jul 20 '23 at 13:44
You're right; if I add `del neighbors` at the end of the loop, the two times become the same. This might be better asked in a separate question or in scikit-learn's repo directly, but it leads to the question of how to free that data more efficiently - given it's taking 5x as long to free it as it does to create it. — brandonsmithj, Jul 20 '23 at 14:01
funny when instead of `del` you use `neighbours = 1+1` - this *heavy* computation takes 3+seconds — dankal444, Jul 20 '23 at 20:34
You might try a C-level profiler to see if you can get more details. I've had success with `perf` for tracking down similar problems (although I use it fairly infrequently so can't remember what arguments to use off the top of my head) — DavidW, Jul 21 '23 at 06:52

Significant delay between python function completion and return

0 Answers0