Most efficient way of conditional pairwise row operations in a pandas DataFrame

Asked Mar 18 '19 at 18:12

Active Mar 18 '19 at 18:12

Viewed 116 times

I have a pandas DataFrame 'df' containing about 10⁶ rows. Now I want to execute the following code:

c = []
for ind, a in df.iterrows():
    for ind, b in df.iterrows():
        if a.hit_id < b.hit_id and a.layer_id != b.layer_id :
            c.append(dist(a, b))
c = numpy.array(c)

What is the most efficient way of doing this?

asked Mar 18 '19 at 18:12

shubham sangamnerkar

This is an O(N^2) operation. You cannot easily make it substantially more efficient. You can somewhat speed up the loop by sorting the DataFrame by "hit_id" and breaking the inner loop when `a.hit_id >= b.hit_id`. But that's about as far as you can get, I am afraid. – DYZ Mar 18 '19 at 18:20
Yes, I realised that, but I thought some type of list comprehension may speed up things? – shubham sangamnerkar Mar 19 '19 at 02:37
List comprehension is just notation. It is not a silver bullet. – DYZ Mar 19 '19 at 03:08
Okay thank you! Can we use something like dask for distributing the computation? If so what is the right way of using it ? – shubham sangamnerkar Mar 19 '19 at 17:41

Most efficient way of conditional pairwise row operations in a pandas DataFrame

0 Answers0