1

I am trying to experiment with sentiment analysis case and I am trying to run a random classifier for the following:

|Topic               |value|label|
|Apples are great    |-0.99|0    |
|Balloon is red      |-0.98|1    |
|cars are running    |-0.93|0    |
|dear diary          |0.8  |1    |
|elephant is huge    |0.91 |1    |
|facebook is great   |0.97 |0    |

after splitting it into train test from sklearn library,

I am doing the following for the Topic column for the count vectoriser to work upon it:

x = train.iloc[:,0:2]
#except for alphabets removing all punctuations
x.replace("[^a-zA-Z]"," ",regex=True, inplace=True)

#convert to lower case
x = x.apply(lambda a: a.astype(str).str.lower())

x.head(2)

After that I apply countvectorizer to the topics column, convert it together with value column and apply Random classifier.

## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

## implement BAG OF WORDS
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])

train_set = pd.concat([x['compound'], pd.DataFrame(traindataset)], axis=1)

# implement RandomForest Classifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,train['label'])

But I receive an error:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'csr_matrix'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-41-7a1f9b292921> in <module>()
      1 # implement RandomForest Classifier
      2 randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
----> 3 randomclassifier.fit(train_set,train['label'])

4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: setting an array element with a sequence.

My idea is:

The values I received are from applying vader-sentiment and I want to apply that too - to my random classifier to see the impact of vader scores on the output.

Maybe is there a way to multiply the data in the value column with sparse matrix traindata generated

Can anyone please tell me how to do that in this case.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
K C
  • 413
  • 4
  • 15
  • Please do not format Python code as Javascript snippets (edited). – desertnaut Dec 03 '20 at 14:35
  • Hi, thanks for your comment. Being new and learning tech & SO can you please share valuable tips on how to share python code snippets in the question? Actually the format for enter code option takes a lot of time as we have to adjust spaces else, just the first line is displayed correctly – K C Dec 03 '20 at 16:57

1 Answers1

0

The issue is concatenating another column to sparse matrix (the output from countvector.fit_transform ). For simplicity sake, let's say your training is:

x = pd.DataFrame({'Topics':['Apples are great','Balloon is red','cars are running',
                           'dear diary','elephant is huge','facebook is great'],
                  'value':[-0.99,-0.98,-0.93,0.8,0.91,0.97,],
                  'label':[0,1,0,1,1,0]})

You can see this gives you something weird:

countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])

train_set = pd.concat([x['value'], pd.DataFrame(traindataset)], axis=1)

train_set.head(2)

    value   0
0   -0.99   (0, 0)\t1\n (0, 1)\t1
1   -0.98   (0, 3)\t1\n (0, 10)\t1

It is possible to convert your sparse to a dense numpy array and then your pandas dataframe will work, however if your dataset is huge this is extremely costly. To keep it as sparse, you can do:

from scipy import sparse

train_set = scipy.sparse.hstack([sparse.csr_matrix(x['value']).reshape(-1,1),traindataset])

randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,x['label'])

Check out also the help page for sparse

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks for showing the way. I think in the code, since we are importing sparse module from spicy, so the code should be : train_set = sparse.hstack([sparse.csr_matrix(x['value']).reshape(-1,1),traindataset]) – K C Dec 05 '20 at 09:22
  • However, using this technique, I am finding errors for the test set while trying to run randomclassifier.predict – K C Dec 05 '20 at 09:23
  • ValueError: Number of features of the model must match the input. Model n_features is 414550 and input n_features is 110739 – K C Dec 05 '20 at 09:25
  • this is another problem altogether. you are running the vectorizer and transformer individually on both datasets. So if you have some words that are present in train and not in test, the number of columns will be different. This applies the other way around – StupidWolf Dec 08 '20 at 00:59
  • you can see this answer i posted on how to do it https://stackoverflow.com/questions/65074784/oversampling-after-splitting-the-dataset-text-classification – StupidWolf Dec 08 '20 at 01:02
  • Thanks a lot for sharing this link. I will try it now. Hope it works. However, since I am relatively new to ML paradigm, I saw some mentions about SMOTE and data leakage - concepts I donot know. Would you mind sharing other links with me to read about those stuffs too. – K C Dec 08 '20 at 12:06
  • Hi @KC, it really depends on your data in my experience, you can check out the scikit-learn package https://imbalanced-learn.org/stable/auto_examples/index.html – StupidWolf Dec 09 '20 at 02:18
  • Thanks for sharing! You are really great. Appreciate the fact that you also explain your answers in detail making it very comprehendable for newbies like me to follow the matter. I wish you good luck in your journey too and I do hope to learn a lot more from you – K C Dec 09 '20 at 09:25