Pandas: how to test that top-n-dataframe really results from original dataframe

Question

I have a DataFrame, foo:

       A   B   C   D   E
    0  50  46  18  65  55
    1  48  56  98  71  96
    2  99  48  36  79  70
    3  15  24  25  67  34
    4  77  67  98  22  78

and another Dataframe, bar, which contains the greatest 2 values of each row of foo. All other values have been replaced with zeros, to create sparsity:

        A  B   C   D   E
    0   0  0   0  65  55
    1   0  0  98   0  96
    2  99  0   0  79   0
    3   0  0   0  67  34
    4   0  0  98   0  78

How can I test that every row in bar really contains the desired values?

One more thing: The solution should work with large DateFrames i.e. 20000 X 20000.

score 0 · Answer 1 · answered Feb 22 '16 at 21:12

Obviously you can do that with looping and efficient sorting, but maybe a better way would be:

n = foo.shape[0]

#Test1:
#bar dataframe has original data except zeros for two values:
diff = foo - bar
test1 = ((diff==0).sum(axis=1) == 2) == n

#Test2:
#bar dataframe has 3 zeros on each line
test2 = ((bar==0).sum(axis=1) == 3) == n

#Test3:
#these 2 numbers that bar has are the max
bar2=bar.replace({0:pandas.np.nan(), inplace=True}
#the max of remaining values is smaller than the min of bar:
row_ok = (diff.max(axis=1) < bar.min(axis=1))
test3 = (ok.sum() == n)

I think this covers all cases, but haven't tested it all...

Pandas: how to test that top-n-dataframe really results from original dataframe

1 Answers1