2

I am researching what features I'll have for my machine learning model, with the data I have. My data contains a lot of textdata, so I was wondering how to extract valuable features from it. Contrary to my previous belief, this often consists of representation with Bag-of-words, or something like word2vec: (http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

Because my understanding of the subject is limited, I dont understand why I can't analyze the text first to get numeric values. (for example: textBlob.sentiment =https://textblob.readthedocs.io/en/dev/, google Clouds Natural Language =https://cloud.google.com/natural-language/)

Are there problems with this, or could I use these values as features for my machine learning model?

Thanks in advance for all the help!

Lourens
  • 104
  • 1
  • 11

1 Answers1

1

Of course, you can convert text input single number with sentiment analysis then use this number as a feature in your machine learning model. Nothing wrong with this approach.

The question is what kind of information you want to extract from text data. Because sentiment analysis convert text input to a number between -1 to 1 and the number represents how positive or negative the text is. For example, you may want sentiment information of the customers' comments about a restaurant to measure their satisfaction. In this case, it is fine to use sentiment analysis to preprocess text data.

But again, sentiment analysis is only given an idea about how positive or negative text is. You may want to cluster text data and sentiment information is not useful in this case since it does not provide any information about the similarity of texts. Thus, other approaches such as word2vec or bag-of-words will be used for the representation of text data in those tasks. Because those algorithms provide vector representation of the text instance of a single number.

In conclusion, the approach depends on what kind of information you need to extract from data for your specific task.

  • Thanks for your response! That makes sense, thanks. I am making a model predicting box office succes of movies based on userdata from youtube, twitter and facebook. I'd say that sentiment is a valuable feature. Besides, would clustering text also be aplicable in my case? – Lourens Sep 16 '17 at 16:19
  • No, your problem is not clustering but regression or classification task base on how you measure success. I think sentiment analysis solves your problem. Because, if user comment is positive about a movie than box office is really successful or vice versa. – Muhammed Hasan Celik Sep 16 '17 at 16:28