0

I have a pandas dataframe where each row corresponds to one sample and each column represents one feature. Now one of my columns is a string column which contains text like "This is a red apple". How can I convert this to a form that pearson's correlation matrix can be computed for this dataframe? Similarly I have another column which takes in a list of identifiers.

Below is an example:

 id     text                   list_of_ids      score1 score2
 1.    "This is An apple"      [1, 2, 3, 4]     4.6.   1.0
 2.    "This is An orange"     [1, 5, 6]        5.2    1.4
desertnaut
  • 57,590
  • 26
  • 140
  • 166
newbie
  • 3
  • 1

1 Answers1

0

Use -

pd.concat([df, df['col1'].str.get_dummies(sep=' ')], axis=1)

Output

    col1    col2    col3    col4    An  This    apple   is  orange
0   This is An apple    [1, 2, 3, 4]    4.6 1.0 1   1   1   1   0
1   This is An orange   [1, 5, 6]   5.2 1.4 1   1   0   1   1

You can then drop the columns that you don't want using .drop

Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42
  • What if the number of words can grow very large? Is this still the right way to compute correlation? – newbie Nov 14 '20 at 05:47