Compare a matrix against a column vector

Question

Arrays 'A' and vector 'B' below are part of pandas dataframe.

I have a large array A of form:

I have a vector B of form:

How do I compare pythonically each column of A against B. I am trying to get True/False values for A < B comparison to get the following result:

TRUE    FALSE   FALSE
FALSE   FALSE   FALSE
TRUE    TRUE    TRUE
TRUE    FALSE   FALSE

I can do list comprehension syntax but is there a better way to pull this off. My array A and B are very large.

piRSquared · Accepted Answer · 2017-03-14T17:36:57.930

Consider the pd.DataFrame and pd.Series, A and B

A = pd.DataFrame([
        [28, 39, 52],
        [77, 80, 66],
        [7, 18, 24],
        [9, 97, 68]
    ])

B = pd.Series([32, 5, 42, 17])

`pandas`

By default, when you compare a pd.DataFrame with a pd.Series, pandas aligns each index value from the series with the column names of the dataframe. This is what happens when you use A < B. In this case, you have 4 rows in your dataframe and 4 elements in your series, so I'm going to assume you want to align the index values of the series with the index values of the dataframe. In order to specify the axis you want to align with, you need to use the comparison method rather than the operator. That's because when you use the method, you can use the axis parameter and specify that you want axis=0 rather than the default axis=1.

A.lt(B, axis=0)

       0      1      2
0   True  False  False
1  False  False  False
2   True   True   True
3   True  False  False

I often just write this as A.lt(B, 0)

`numpy`

In numpy, you also have to pay attention to the dimensionality of the arrays and you are assuming that the positions are already lined up. The positions will be taken care of if they come from the same dataframe.

print(A.values)

[[28 39 52]
 [77 80 66]
 [ 7 18 24]
 [ 9 97 68]]

print(B.values)

[32  5 42 17]

Notice that B is a 1 dimensional array while A is a 2 dimensional array. In order to compare B along the rows of A we need to reshape B into a 2 dimensional array. The most obvious way to do this is with reshape

print(A.values < B.values.reshape(4, 1))

[[ True False False]
 [False False False]
 [ True  True  True]
 [ True False False]]

However, these are ways in which you will commonly see others do the same reshaping

A.values < B.values.reshape(-1, 1)

Or

A.values < B.values[:, None]

timed back test

To get a handle of how fast these comparisons are, I've constructed the following back test.

def pd_cmp(df, s):
    return df.lt(s, 0)

def np_cmp_a2a(df, s):
    """To get an apples to apples comparison
    I return the same thing in both functions"""
    return pd.DataFrame(
        df.values < s.values[:, None],
        df.index, df.columns
    )

def np_cmp_a2o(df, s):
    """To get an apples to oranges comparison
    I return a numpy array"""
    return df.values < s.values[:, None]


results = pd.DataFrame(
    index=pd.Index([10, 1000, 100000], name='group size'),
    columns=pd.Index(['pd_cmp', 'np_cmp_a2a', 'np_cmp_a2o'], name='method'),
)

from timeit import timeit

for i in results.index:
    df = pd.concat([A] * i, ignore_index=True)
    s = pd.concat([B] * i, ignore_index=True)
    for j in results.columns:
        results.set_value(
            i, j,
            timeit(
                '{}(df, s)'.format(j),
                'from __main__ import {}, df, s'.format(j),
                number=100
            )
        )

results.plot()

I can conclude that the numpy based solutions are faster but not all that much. They all scale the same.

Since OP says the arrays are very large, performance comparison b/w Pandas vs Numpy would be much informative for future readers — kmario23, Mar 14 '17 at 16:42

B. M. · Answer 2 · 2017-03-14T16:46:56.143

3

The more efficient is to go down numpy level (A,B are DataFrames here):

A.values<B.values

edited Mar 14 '17 at 16:46

answered Mar 14 '17 at 16:12

B. M.

18,243
2
35
54

score 3 · Answer 3 · answered Mar 14 '17 at 16:14

You can do this using lt and calling squeeze on B so it flattens the df to a 1-D Series:

In [107]:
A.lt(B.squeeze(),axis=0)

Out[107]:
       0      1      2
0   True  False  False
1  False  False  False
2   True   True   True
3   True  False  False

The problem is that without squeeze then it will try to align on the column labels which we don't want. We want to broadcast the comparison along the column-axis

kmario23 · Answer 4 · 2018-01-18T02:07:11.053

2

Yet another option using numpy is with numpy.newaxis

In [99]: B = B[:, np.newaxis]

In [100]: B
Out[100]: 
array([[32],
       [ 5],
       [42],
       [17]])

In [101]: A < B
Out[101]: 
array([[ True, False, False],
       [False, False, False],
       [ True,  True,  True],
       [ True, False, False]], dtype=bool)

Essentially, we're converting the vector B into a 2D array so that numpy can broadcast when comparing two arrays of different shapes.

edited Jan 18 '18 at 02:07

answered Mar 14 '17 at 17:04

kmario23

57,311
13
161
150

@EdChum `np.newaxis` is synonymous with `None`. `np.newaxis is None` evaluates to `True` – piRSquared Mar 14 '17 at 17:16
@piRSquared ah.. I didn't know that, I've always use `None` in these situations – EdChum Mar 14 '17 at 17:17
@EdChum as do most people because it's 6 characters less to write :-) – piRSquared Mar 14 '17 at 17:18
@piRSquared yeah, I'm lzy 2 – EdChum Mar 14 '17 at 17:18
1

@EdChum I prefer this way since it's more intuitive (implies that we're increasing the array axis by 1, each time we use `newaxis`) and readable :) – kmario23 Mar 14 '17 at 18:20

Compare a matrix against a column vector

4 Answers4

`pandas`

`numpy`

timed back test

Linked