20

I have trained a Logistic Regression classifier to predict whether a review is positive or negative. Now, I want to append the predicted probabilities returned by the predict_proba-function to my Pandas data frame containing the reviews. I tried doing something like:

test_data['prediction'] = sentiment_model.predict_proba(test_matrix)

Obviously, that doesn't work, since predict_proba returns a 2D-numpy array. So, what is the most efficient way of doing this? I created test_matrix with SciKit-Learn's CountVectorizer:

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train_data['review_clean'].values.astype('U'))
test_matrix = vectorizer.transform(test_data['review_clean'].values.astype('U'))

Sample data looks like:

| Review                                     | Prediction         |                      
| ------------------------------------------ | ------------------ |
| "Toy was great! Our six-year old loved it!"|   0.986            |
DBE7
  • 766
  • 2
  • 9
  • 23

2 Answers2

24

Assign the predictions to a variable and then extract the columns from the variable to be assigned to the pandas dataframe cols. If x is the 2D numpy array with predictions,

x = sentiment_model.predict_proba(test_matrix)

then you can do,

test_data['prediction0'] = x[:,0]
test_data['prediction1'] = x[:,1]
Karthik Arumugham
  • 1,300
  • 1
  • 11
  • 18
3
import numpy as np
import pandas as pd

df = pd.DataFrame(
    np.arange(10).reshape(5, 2), columns=['a', 'b'])
print('df:', df, sep='\n')

arr = np.arange(100, 104).reshape(2, 2)
print('array to append:', arr, sep='\n')

df = df.append(pd.DataFrame(arr, columns=df.columns), ignore_index=True)
print('df:', df, sep='\n')

output

df:
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
array to append:
[[100 101]
 [102 103]]
df:
     a    b
0    0    1
1    2    3
2    4    5
3    6    7
4    8    9
5  100  101
6  102  103
Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58