0

Hope all of you guys are healthy and well. I am new to the world of NLP and my question may sound stupid, so I apologize in advance.I would like to perform NLP on some text data which is labeled and run a text mining predictive model. I have four text columns that can be used as predictors and my labeled column is my class variable. Perhaps, the following can give you a glimpse of the data set

 var1    var2  var3    var4      class_var
  NA     text  text     NA          0
  text   text   NA     text         1
  text    NA    NA     text         1
  NA      NA    NA     text         0
  NA     text  text    text         1  

As shown, in some columns there are no texts ( I put NAs) I have texts in other columns. That being said, my question whether I should combine all text columns into one? if so, what would be an appropriate method for dealing with this issue?

I truly appreciated your help guys.

Many thanks!

Phil
  • 7,287
  • 3
  • 36
  • 66
Alex
  • 245
  • 1
  • 7

1 Answers1

0

There are way too many options here but seeing as your data is already split into four columns, maybe you can first just replace the texts with a 1 if text is present or 0 for NA and see how well you can predict the class_var with a simple logistic regression as a start. From there, you could go into tokenizers etc.

kana
  • 605
  • 7
  • 12
  • so should I not combine all texts into one column and start from there? – Alex Apr 30 '21 at 04:58
  • 1
    That would probably be a good next step after checking the above. The fact that your data is separated in the first place to me implies there is some kind of segmentation of the information, which is why I recommended the 1 or 0 first. I'd combine them, tokenize them, and then try to classify. Secondly, I'd try to split each column separately, tokenize, then try to classify to see if one column is really important. – kana Apr 30 '21 at 17:03
  • Thanks, I'll act accordingly and will be in touch for more help – Alex Apr 30 '21 at 18:24