-5

I'm trying to train a classifier to classify text from a chat between 2 users so later on I can predict who of the two users is more likely to say X sentence/word. To get there I mined the text from the chat log and ended up with two arrays of words, UserA_words and UserB_words.

In which format do I have to transform this arrays to pass it to a classifier like naiveBayes or SVM? How do I pass e.g. a bag of words representation to a classifier?

smci
  • 32,567
  • 20
  • 113
  • 146
xgrimau
  • 664
  • 8
  • 19
  • Asking what ML representation to use for a specific classification task is on-topic at sister site [DataScience.SE](http://datascience.stackexchange.com). Please migrate there. – smci Oct 23 '16 at 22:23
  • Putting this on hold is not constructive: either migrate to DataScience.SE or leave open here. My answer shows that this has an actual answer. – smci Oct 26 '16 at 09:15
  • @smci Sorry, i'm new here, I posted a similar question in Data Science SE, but how do I migrate this one ? Thank you in advance – xgrimau Oct 26 '16 at 10:53
  • whiteTea you can't do anything - it's the users with [>3k reputation](http://stackoverflow.com/help/privileges) who voted to close instead of migrate, and are not voting to reopen or migrate. – smci Oct 26 '16 at 19:25
  • Please don't crosspost, but since you already posted [this](http://datascience.stackexchange.com/questions/14730/chat-text-classification-aproach/14768) at DataScience.SE let's take things over there. – smci Oct 26 '16 at 19:29

1 Answers1

0

You're asking what ML representation you should use for user-classification of chat text.

bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user. Here are some:

  • character length, word length, sentence length of each comment
  • typing speed (esp. if you have timestamps in seconds)
  • ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
  • ratio of capitalization
  • ratio of numerals
  • ratio of whitespace
  • character n-grams (and notice these can pick up e.g. l0ser, f##k, :-) )
  • use of Unicode (emojis, symbols e.g. stars)
  • ratio of specific punctuation (e.g. how many '.', '!', '?', '*', '#' )
  • word-counts, esp. anything statistically anomalous
  • anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)
smci
  • 32,567
  • 20
  • 113
  • 146