0

I’m developing a Naive Bayes classifier using the following dataset (https://www.kaggle.com/crowdflower/twitter-user-gender-classification/data).

What i’m trying to do is traing a classifier which allows me to predict the user gender based on twitter text, twitter profile description and twitter profile side color. Since twitter text and profile description attributes are a string columns, I need to preprocessing the data before training the classifier. In order to do that, i saw that in a lot of examples is used the Strings to Document node. Then, this new column Document is preprocessed with other node like Number filter, Case converter and so on.

Since I want use more that one attributes to training my classifier, what I have to do? Should I convert into documents both string attributes (twitter text and profile description)?

Giordano
  • 5,422
  • 3
  • 33
  • 49
  • It is up to you to decide what to do with your data. If you do not want to use two Strings to Document nodes, you can simply concatenate the two string columns before that (though that might be not what you want as they are different texts). I do not see any problems having two Strings to Document nodes in the workflow. – Gábor Bakos May 01 '18 at 12:07

1 Answers1

0

I can suggest you create a metanode with all the preprocessing you want and then copy this metanode to preprocess each string column you consider useful for your model. Then just use data extractor node and concatenate the preprocessed string columns with column appender node into a new table.

Jason Angel
  • 2,233
  • 1
  • 14
  • 14