How to efficiently compare each pair of rows in a 2D matrix?

Question

I am working on a subroutine where I need to process each row of a matrix and find which other rows are contained in the current row. For illustration of when a row contains another, consider a 3x3 matrix as below:

[[1, 0, 1], 

 [0, 1, 0], 

 [1, 0, 0]]

Here row 1 contains row 3 because each element in row 1 is greater than or equal to row 3 but row 1 doesn't contain row 2.

I came up with the following solution but it is very slow because of the for loop(matrix is around 6000x6000 size).

for i in range(no_of_rows):
    # Here Adj is the 2D matrix 
    contains = np.argwhere(np.all(Adj[i] >= Adj, axis = 1))

Could you please let me know if it is possible to do it more efficiently?

Try to start with `np.triu_indices` and check this https://stackoverflow.com/questions/52690963/efficient-pairwise-comparisons-rows-of-numpy-2d-array — kvitaliy, May 01 '19 at 19:40
Do you want to recognize that each element contains itself? Also, you've tagged this with `broadcasting`. You could try `(a >= a[:, None]).all(-1)`, but this will quickly blow up your memory with large arrays — user3483203, May 01 '19 at 19:49
@kvitaliy Thanks! I already looked at that post. In my case, triu_indices result in really large (25715206 sized) R and C vectors and it takes a lot of time. — randomprime, May 01 '19 at 19:52
@user3483203 Thanks! The word 'contains' means comparing each row with every other row element-wise. — randomprime, May 01 '19 at 19:55
Yes, so the first row, by definition, contains the first row. Do you want those included? — user3483203, May 01 '19 at 19:57
@user3483203 Yes, including row's comparison with itself is fine because in the next step of the problem, I need to zero out rows which are contained in the current row. I can just skip zeroing out the current row in that case. — randomprime, May 01 '19 at 20:01
What are you doing with the result. For your basic 3x3 matrix above, what is your desired output? — user3483203, May 01 '19 at 20:02
@user3483203 Also Thanks for the broadcasting solution. I just tried it out but as you said it resulted in memory problems with matrix I am operating on(size 6000x6000). — randomprime, May 01 '19 at 20:02
@user3483203 For the 3x3 matrix since row 3 is contained in row 1, I will set all elements in row 3 to zero in the next step. To give more context of the problem, assuming rows represent power stations and columns represent consumers, this is a preprocessing step which states that if there is some power station `x` that covers all consumers that some other power station `y` covers and maybe more, then station `y` is not required. — randomprime, May 01 '19 at 20:03

user3483203 · Accepted Answer · 2019-05-02T06:42:34.717

Due to the size of your matrices, and the requirements of your problem, I think iteration is unavoidable. You can't make use of broadcasting, since it will explode your memory, so you need to operate on the existing array row by row. You can use numba and njit to speed this up considerably over a pure-python approach however.

import numpy as np
from numba import njit


@njit
def zero_out_contained_rows(a):
    """
    Finds rows where all of the elements are
    equal or smaller than all corresponding
    elements of anothe row, and sets all
    values in the row to zero

    Parameters
    ----------
    a: ndarray
      The array to modify

    Returns
    -------
    The modified array

    Examples
    --------
    >>> zero_out_contained_rows(np.array([[1, 0, 1], [0, 1, 0], [1, 0, 0]]))
    array([[1, 0, 1],
            [0, 1, 0],
            [0, 0, 0]])
    """
    x, y = a.shape

    contained = np.zeros(x, dtype=np.bool_)

    for i in range(x):
        for j in range(x):
            if i != j and not contained[j]:
                equal = True
                for k in range(y):
                    if a[i, k] < a[j, k]:
                        equal = False
                        break
                contained[j] = equal

    a[contained] = 0

    return a

This keeps a running tally of whether or not a row is used in another row. This prevents many unnecessary comparisons by short-circuiting, before finally wiping out rows that are contained in others with 0.

Compared to your initial attempt using iteration, this is a speed improvement, as well as also handles zero-ing out the proper rows.

a = np.random.randint(0, 2, (6000, 6000))

%timeit zero_out_contained_rows(a)
1.19 s ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I will update the timings once your attempt finishes running (currently at ~10 minutes).

score 0 · Answer 2 · answered May 01 '19 at 20:47

If you have matrix 6000x6000 than you need (6000*6000 - 6000)/2 = 17997000 calculations.

Instead of using np.triu_indices, you can try to use a generator for the top triangle of your matrix - it should decrease memory consumption. Try this, maybe it will help..

def indxs(lst):
   for i1, el1 in enumerate(lst):
      for el2 in lst[i1:][1:]:
         yield (el1, el2)

How to efficiently compare each pair of rows in a 2D matrix?

2 Answers2