0

I would like to ask a question for a numpy array below.

I have a dataset, which has 50 rows and 15 columns and I created a numpy array as such:

x=x.to_numpy()

My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.

Sample table:

a b c         
1 6 2
2 6 8
4 7 12
7 9 13

for example for row 1 and row2 there is no such a row. But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those. So the algorithm should return the count 2 (which indicates the row 3 and 4).

Which Python code should be implemented to get this particular return.

I have tried a bunch of code, but could not reach a proper solution. So if anyone has an idea on that I would be appreciated.

Yash Mehta
  • 2,025
  • 3
  • 9
  • 20
  • "Smaller" as in `<=` or as in `<`? (It makes a difference as to whether we need to explicitly exclude the current row or not.) – Ture Pålsson Dec 17 '22 at 09:32

1 Answers1

0

Edit: Pure-numpy solution

(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()

Explanation

The hard part is to think in 3d, so I start in 2d, with simple comparison of numbers. Imagine you have x=np.array([1,2,3,4]) and you want to compare all elements of x to all other elments of x, making a matrix 4x4 matrix of booleans.

What you would do, is to reshape x as a column of values on one side, and as a line on the other. So two 2d arrays: one 4x1, the other 1x4.

Then, when performing an operation among those two arrays, broadcasting will create a 4x4 array.

Just to visualize it, instead of comparison, let's do this

x=np.array([1,2,3,4])
x.reshape(-1,1) #is
#[[1],
# [2],
# [3],
# [4]]
x.reshape(1,-1) #is
# [ [1,2,3,4] ]
x.reshape(-1,1)*10+x.reshape(1,-1) #is therefore
# [[11, 12, 13, 14],
#  [21, 22, 23, 24],
#  [31, 32, 33, 34],
#  [41, 42, 43, 44]]

# Likewise 
x.reshape(-1,1)<x.reshape(1,-1) # is
#array([[False,  True,  True,  True],
#       [False, False,  True,  True],
#       [False, False, False,  True],
#       [False, False, False, False]])

So, all we have to do is the exact same thing. But with values being length-3 1d arrays instead of scalars:
x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)

Broadcasting will make this, as in previous example, a 2d array of all x[i]>x[j], except that x[i], x[j] and therefore x[i]>x[j] are not values, but 1d length 3 arrays. So our result is a 2d array of length 3 1d array, aka a 3d array.

Now we just have to do our all, any, sum on this. For x[i] to be considered x[j], we need all the values of x[i] to be > to all values of x[j]. Hence the all on axis 2 (the axis of length 3). Now we have a 2d matrix telling for each i,j if x[i]>x[j].

For x[j] to have a smaller counterpart, that is for x[j] to be greater to at least one x[i], we need at least one True on x[j] column. Hence the any(axis=1).

And lastly, what we have at this point is a 1d array of booleans, True if it exists at least one smaller value. We just need to count them. Hence the .sum()

Compound iteration

One-liner (with one loop. Not ideal, but better than 2 loops)

sum((r>x).all(axis=1).any() for r in x)

r>x is an array of booleans comparing each elemnts of row r to each element of x. So, for example, when r is row x[2], then r>x is

array([[ True,  True,  True],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False]])

So (r>x).all(axis=1) is a shape (4,) array of booleans telling if all booleans in each line (because .all iterates through columns only, axis=1) are True or not. In previous example, that would be [True, True, False, False]. (x[1]>x).all(axis=1) would be [False, False, False, False] (first line of x[1]>x contains 2 True, but that is not enough for .all)

So (r>x).all(axis=1).any() tells what you want to know: if there is any line whose all columns are True. That is if there is any True in previous array.

((r>x).all(axis=1).any() for r in x) is an iterator of this computation for all rows r of x. If you replaced the outer ( ) by [, ], you would get a list of True and False (False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others). But no need to build a list here, since we just want to count. A compound iterator will produce result only as the caller will require, and here, the caller is sum.

sum((r>x).all(axis=1).any() for r in x) counts the number of times we get True in the previous computation.

(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory)

Timings

For your example, computation takes 19 μs for pure numpy, 48 μs for former answer and 115 μs for di.bezrukov's.

But difference (and absence of difference) shows when the number of rows grows. For 10000×3 data, then, computation takes 3.9 seconds for both my answers, and di.bezrukov's method takes 353 seconds.

Reason behind this 2 facts:

  • the fact the difference grows bigger with di.bezrukov's, is because the number of inner for loops that I avoid grows bigger, and they matter a lot
  • the fact that difference between my 2 versions disappear, is because my 2nd version (chronologically, first in this message, aka my pure numpy version) only spare the outer loop. Where the number of rows is not that big, that is not negligible. But when it is big... well that outer loop itself (not counting its content, that is optimized by the innter loop) is just O(n), in a O(n²) result. So, if n is big enough, we just don't care how efficient is this outer loop.
  • Even worst: memory wise, that pure numpy version does what I was so proud of not doing in my first version: compute a full list of result. And that is nothing. It also compute a full 3d matrix of booleans. That are just intermediary result. So, for n big enough (say 100000, unless you have some 50Gb of RAM) that intermediary result doesn't fit into memory. And even if you have 50Gb of RAM, it won't be faster)

Still, all 3 methods are O(n²). O(n²×m) even, if we call m the number of columns

All have 3 nested loops. Di.bezrukov's have two explicit python for loop, and one implicit loop in the .all (still a for loop, even if it is done in numpy's internal code). My compound version has 1 python compound for loop, and 2 implicit loops .all and .any.
My pure numpy version have no explicit loop, but 3 implicity numpy's nested loop (in the building of the 3d array)

So same time structure. Only numpy's loop are faster.

I am prouder of my pure numpy version, because I didn't found it at first. But pragmatically, my first version (compound) is better. It is slower only when it doesn't matter (for very small arrays). It doesn't consume any memory. And it numpize only the outer loop, that is negligible before inner loop.

tl;dr:

sum((r>x).all(axis=1).any() for r in x)

Unless you really have only 4 rows and μs matter, or you are engaged in a contest of who can think in purest numpy 3d-chess :D, in which case

(x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)).all(axis=2).any(axis=1).sum()
chrslg
  • 9,023
  • 5
  • 17
  • 31
  • I think you could do this in a single expression, but I'm too lazy (and not good enough at numpy index manipulations) to sort out the details: Consider the array as a 3D "slice", and turn this into a "loaf" by stacking several copies of it. Then take one slice and transpose it to a "lid" that you hold over the loaf. Now, you can perform the elementwise comparison by a single broadcast `<` between the lid and the loaf. Then you just need some `all`, `any` and `sum` to collect the result. – Ture Pålsson Dec 17 '22 at 10:26
  • I also think so. The reason why I started with "not ideal but", is because my first intent was to just post the one-liner, and then try to think a bit about a single expression (well, technically, my answer is a single expression. But I guess you mean, a single expression that does not imply a compound for). But then I got carried away by explanations and timings :D – chrslg Dec 17 '22 at 11:35
  • Plus, as I explain, it would numpize only the outer loop, which is negligible. – chrslg Dec 17 '22 at 12:18
  • @TurePålsson But see my edit. I think this is what you had in mind. It is more satisfying when you want to train yourself to use numpy as much as possible. But in this case, it is not that an improvement, and memory-wise, it may even be a disaster. – chrslg Dec 17 '22 at 12:19
  • @chrsig: Yes, that’s something like what I had in mind. And yes, it’s completely unreadable, allocates a 3D block of memory (it’s a pity numpy isn’t "lazy"...) and probably not very efficient. And longer than your first version. But fun to come up with! :) (The best I could do was even longer than yours...) – Ture Pålsson Dec 17 '22 at 12:29