Matrix of pairwise row operations on pandas.DataFrame

Question

I want to create a matrix of the results of operations on all pairs of rows in a DataFrame.

Here's an example of what I want:

df = pandas.DataFrame({'val':  [ 2,    3,    5,    7  ],
                       'foo':  ['f1', 'f2', 'f3', 'f4']},
                      index=   ['n1', 'n2', 'n3', 'n4'])

def op1(row1, row2):
    return row1['val']*row2['val']

def op2(row1, row2):
    return f"{row1['foo']}{row2['foo']}"

def apply_op_to_all_row_pairs(df, op):
    # what goes in here?

apply_op_to_all_row_pairs(df, op1)
#     n1  n2  n3  n4
# n1   4   6  10  14
# n2   6   9  15  21
# n3  10  15  25  35
# n4  14  21  35  49

apply_op_to_all_row_pairs(df, op2)
#         n1      n2      n3      n4
# n1  'f1f1'  'f1f2'  'f1f3'  'f1f4'
# n2  'f2f1'  'f2f2'  'f2f3'  'f2f4'
# n3  'f3f1'  'f3f2'  'f3f3'  'f3f4'
# n4  'f4f1'  'f4f2'  'f4f3'  'f4f4'

I have seen a lot of solutions which hinge on extant functions for computing distance matrices but I want something more generic. E.g., scipy.spatial.distance.pdist does the format of what I want, but only deals in floats and doesn't let you select columns by name (or at least I couldn't figure out how).

cs95 · Accepted Answer · 2018-12-28T05:05:40.880

1

You can just use broadcasted numpy operations:

v = df.val.values[:, None] * df.val.values
v

array([[ 4,  6, 10, 14],
       [ 6,  9, 15, 21],
       [10, 15, 25, 35],
       [14, 21, 35, 49]])

x = df.foo.values[:, None] + df.foo.values
x

array([['f1f1', 'f1f2', 'f1f3', 'f1f4'],
       ['f2f1', 'f2f2', 'f2f3', 'f2f4'],
       ['f3f1', 'f3f2', 'f3f3', 'f3f4'],
       ['f4f1', 'f4f2', 'f4f3', 'f4f4']], dtype=object)

Conversion to a dataframe is super simple, just call the constructor:

pd.DataFrame(x, df.index, df.index)

      n1    n2    n3    n4
n1  f1f1  f1f2  f1f3  f1f4
n2  f2f1  f2f2  f2f3  f2f4
n3  f3f1  f3f2  f3f3  f3f4
n4  f4f1  f4f2  f4f3  f4f4

edited Dec 28 '18 at 05:05

answered Nov 02 '17 at 11:41

cs95

379,657
97
704
746

This doesn't complete the function in the format of the question, but nonetheless it does exactly what I want. Thanks! – Cai Nov 02 '17 at 12:49
@Cai That's great to hear! Thanks for letting me know. – cs95 Nov 02 '17 at 12:49
To clarify what I was originally asking: a way of having separately defined row-pair operations each of which could be applied to a dataframe. The difference between this and what you have given is that the function definition (`foo + foo`) is mixed in with the broadcasting. The fully separate one would be nice to have, but for what I need now this is completely sufficient. – Cai Nov 02 '17 at 12:53
1

@Cai I understand what you asked now. Unfortunately, decoupling that is expensive, and would result in greatly reduced performance. Just an FYI. – cs95 Nov 02 '17 at 12:54

Matrix of pairwise row operations on pandas.DataFrame

1 Answers1