0

I am trying to create a matrix of the results of a function, which involves a crosstab of dataframe columns. The function operates on a pair of dataframe columns in turn, so that the end result is a matrix of the results applied to each pair. The column indices of the columns I want to operate the pd.crosstab on, are in a list, cols_index. Here's my code:

cols_index # list of dataframe column indices. All fine. 

res_matrix = np.zeros([len(cols_index),len(cols_index)]) # square matrix of zeros, each dimension is the length of the number of columns

for i in cols_index:
    for j in cols_index:
        confusion_matrix = pd.crosstab(df.columns.get_values()[i], df.columns.get_values()[j]) # df.columns.get_values()[location]
        result = my_function(confusion_matrix) # a scalar
        res_matrix[i, j] = result
return res_matrix

However I get the following error: ValueError: If using all scalar values, you must pass an index

There's no problem with my_function because if I run my_function on two columns of the dataframe, there's no issue:

confusion_matrix = pd.crosstab(df['colA'], df['colB'])
result = my_function(confusion_matrix) # returns 0.29999 which is fine

I've tried various ways of fixing this, including looking at this post: How to fill a matrix in Python using iteration over rows and columns

but in this case I can't see how to use broadcasting over the Pandas columns.

Any ideas appreciated, thanks.

LucieCBurgess
  • 759
  • 5
  • 12
  • 26

1 Answers1

0

Few issues in your code -

  1. i and j should be numeric as you are using it as index.
  2. you need to provide pandas.Series for crosstab, you are provinding strings (even with correct values of i and j)

Please see the changes in code below -

def fun():
cols_index # list of dataframe column indices. All fine. 
res_matrix = np.zeros([len(cols_index),len(cols_index)]) # square matrix of zeros, each dimension is the length of the number of columns
for i in range(len(cols_index)):
    for j in range(i+1,len(cols_index)):
        confusion_matrix = pd.crosstab(df[df.columns[cols_index[i]]], df[df.columns[cols_index[j]]]) # df.columns.get_values()[location]
        result = my_function(confusion_matrix) # a scalar
        res_matrix[i, j] = result
return res_matrix

I have modified the code as per OPs comment, that col_index is list of index of columns. Also, I am assuming the my_function is commutative, and hence I am filling just the top diagonal matrix. This will save computation time and will not create issues of i==j

Aritesh
  • 1,985
  • 1
  • 13
  • 17
  • Thanks @Aritesh for your help. The problem with ```i in range(len(cols_index))``` is that this starts i from zero, whereas the cols_index list is a selection of columns from the dataframe, e.g. [10, 17, 23, 24, 26, 52, 56]. So I think I do need ```for i in cols_index``` as I need i to be [10, 17, 23, 24, 26, 52, 56], not [0, 1, 2, 3, 4, 5, 6] which will return the wrong columns of the dataframe when I call crosstab. To be clear, cols_index is a list of ints. – LucieCBurgess May 31 '18 at 11:46
  • My next problem is that ```pd.crosstab``` doesn't seem to like being called on the same columns: ```confusion_matrix = pd.crosstab(df[df.columns[i]], df[df.columns[j]]``` throws an error if i == j – LucieCBurgess May 31 '18 at 11:49
  • @LucieCBurgess, I will then add a conditional statement if(i !=j). Also, if your function is commutative (i.e. your result does not change by the order of operands, then run it only of j>i – Aritesh May 31 '18 at 12:01