What representation of chat text data should I use for user classification?

Question

I'm trying to train a classifier to classify text from a chat between 2 users so later on I can predict who of the two users is more likely to say X sentence/word. To get there I mined the text from the chat log and ended up with two arrays of words, UserA_words and UserB_words.

In which format do I have to transform this arrays to pass it to a classifier like naiveBayes or SVM? How do I pass e.g. a bag of words representation to a classifier?

Asking what ML representation to use for a specific classification task is on-topic at sister site [DataScience.SE](http://datascience.stackexchange.com). Please migrate there. — smci, Oct 23 '16 at 22:23
Putting this on hold is not constructive: either migrate to DataScience.SE or leave open here. My answer shows that this has an actual answer. — smci, Oct 26 '16 at 09:15
@smci Sorry, i'm new here, I posted a similar question in Data Science SE, but how do I migrate this one ? Thank you in advance — xgrimau, Oct 26 '16 at 10:53
whiteTea you can't do anything - it's the users with [>3k reputation](http://stackoverflow.com/help/privileges) who voted to close instead of migrate, and are not voting to reopen or migrate. — smci, Oct 26 '16 at 19:25
Please don't crosspost, but since you already posted [this](http://datascience.stackexchange.com/questions/14730/chat-text-classification-aproach/14768) at DataScience.SE let's take things over there. — smci, Oct 26 '16 at 19:29

smci · Accepted Answer · 2016-10-26T09:47:17.980

You're asking what ML representation you should use for user-classification of chat text.

bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user. Here are some:

character length, word length, sentence length of each comment
typing speed (esp. if you have timestamps in seconds)
ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
ratio of capitalization
ratio of numerals
ratio of whitespace
character n-grams (and notice these can pick up e.g. l0ser, f##k, :-) )
use of Unicode (emojis, symbols e.g. stars)
ratio of specific punctuation (e.g. how many '.', '!', '?', '*', '#' )
word-counts, esp. anything statistically anomalous
anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)

What representation of chat text data should I use for user classification?

1 Answers1