Calculate Pearson's Coefficient for Multidimensional features

Question

I have a pandas dataframe where each row corresponds to one sample and each column represents one feature. Now one of my columns is a string column which contains text like "This is a red apple". How can I convert this to a form that pearson's correlation matrix can be computed for this dataframe? Similarly I have another column which takes in a list of identifiers.

Below is an example:

 id     text                   list_of_ids      score1 score2
 1.    "This is An apple"      [1, 2, 3, 4]     4.6.   1.0
 2.    "This is An orange"     [1, 5, 6]        5.2    1.4

score 0 · Answer 1 · answered Nov 13 '20 at 19:58

0

Use -

pd.concat([df, df['col1'].str.get_dummies(sep=' ')], axis=1)

Output

    col1    col2    col3    col4    An  This    apple   is  orange
0   This is An apple    [1, 2, 3, 4]    4.6 1.0 1   1   1   1   0
1   This is An orange   [1, 5, 6]   5.2 1.4 1   1   0   1   1

You can then drop the columns that you don't want using .drop

answered Nov 13 '20 at 19:58

Vivek Kalyanarangan

8,951
1
23
42

What if the number of words can grow very large? Is this still the right way to compute correlation? – newbie Nov 14 '20 at 05:47

Calculate Pearson's Coefficient for Multidimensional features

1 Answers1