I want to create a matrix of the results of operations on all pairs of rows in a DataFrame.
Here's an example of what I want:
df = pandas.DataFrame({'val': [ 2, 3, 5, 7 ],
'foo': ['f1', 'f2', 'f3', 'f4']},
index= ['n1', 'n2', 'n3', 'n4'])
def op1(row1, row2):
return row1['val']*row2['val']
def op2(row1, row2):
return f"{row1['foo']}{row2['foo']}"
def apply_op_to_all_row_pairs(df, op):
# what goes in here?
apply_op_to_all_row_pairs(df, op1)
# n1 n2 n3 n4
# n1 4 6 10 14
# n2 6 9 15 21
# n3 10 15 25 35
# n4 14 21 35 49
apply_op_to_all_row_pairs(df, op2)
# n1 n2 n3 n4
# n1 'f1f1' 'f1f2' 'f1f3' 'f1f4'
# n2 'f2f1' 'f2f2' 'f2f3' 'f2f4'
# n3 'f3f1' 'f3f2' 'f3f3' 'f3f4'
# n4 'f4f1' 'f4f2' 'f4f3' 'f4f4'
I have seen a lot of solutions which hinge on extant functions for computing distance matrices but I want something more generic.
E.g., scipy.spatial.distance.pdist
does the format of what I want, but only deals in floats and doesn't let you select columns by name (or at least I couldn't figure out how).