-1

I am very new to scikit and have a usecase which I am trying to solve through scikit python library.

I have CSV file like this:

Label , userId , message , user_like,user_dislike

1 , 1, "this is good message", 4,5

0, 1, "This is bad message",3,4

1, 2, "this is good message" , 4,5

0,1, "This is bad again" , 6,7

How can I train classifier MultinomialNB from above data. My Challenge is it contains both text data (messages) as well as numeric data.

I want to predict whether message "this is new message" posted by userId 1 is spam or not ( 0,1) ..

So ? , 1 , "this is new message" , 3 4

Thanks

voila
  • 1,594
  • 2
  • 19
  • 38
  • Train it to predict what based on what? – BrenBarn May 26 '15 at 05:12
  • Ok .. So I want to predict if message "this is new messgae" , posted by user Id 1 is spam or not ( 0,1) .. Input data would be ?,1,"this is new message" ,7,8 .. – voila May 26 '15 at 05:18
  • I wish someone tell me why they are downvoting the question ??? – voila May 26 '15 at 05:31
  • @voila the question is not very clear. Look at the title - how meaningless it is. Rather you must have asked "How to combine text and numerical features in sklearn". Added my answer.. Hope it helps. – Aditya May 27 '15 at 05:33

1 Answers1

1

A simple yet effective idea would be to train separate classifiers for text and numeric data. Make sure you normalize as you go.

Now when you have, say, two different classifiers, you can combine their results to predict whether it is a spam or not. Check http://scikit-learn.org/stable/modules/ensemble.html

To further improve it, you can try using the internal probability scores of each classifier, use them as features to train another classifier for final prediction. This is called stacking.

Aditya
  • 3,080
  • 24
  • 47