-1

I'm trying to classify sentences on a given topic using machine learning. However, I can't seem to find the adequate algorithm / solution for this particular problem.

Some details:

I have tokenized, lemmatized and vectorized the sentences. So, given a sentence:

How will the weather be today?

It gets tokenized:

['How', 'will', 'the', 'weather', 'be', 'today?']

It then gets lemmatized:

['How', 'weather', 'today']

And then based of a small dictionary (~100 words) I built, the sentenced gets transformed in a sequence of 0 or 1, signaling if the words appear in the dictionary or not:

[0, 0, 0, 1, .... 0, 1]

I've built myself a small dataset (~50 sentences split in 3 topics) and now I need an algorithm that will train on the dataset and that'll predict one of those 3 classes, given a new sentence.

Deep learning is not efficient given the reduced size of the dataset. I've tried liniar regression but outputs random very big numbers. Any ideas on what should I try or if any mistakes were made?

DanBrezeanu
  • 523
  • 3
  • 13

2 Answers2

1

You've pre-processed the data correctly. You neglected to describe the interconnections of the data set, so we have no way of knowing whether 50 observations is sufficient to discriminate the 3 classes.

If your data do contain sufficient information for the task, I expect that either a simple bag-of-words approach will give you clusters close to what you want; you can insert the classifications for the training. If that's too "loose" for your application, I recommend letting SVM solve your problem. A 3-class machine should be able to reduce the component complexity and find good boundaries for your model.

Finally, if the topics turn on only one or two key words in each class, you might do better with a simple decision tree.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thanks for the help! I'll try a SVM to see if it solves my problem, regarding I'm planning to add more classes in the future. – DanBrezeanu May 15 '19 at 22:01
0

A good option to try for text classification would be Naive Bayes (if you're willing to assume that the words in each text are conditionally independent given the class - which is a strong assumption, but works a surprising amount of the time!).

The problem is formulated as follows, where y is the predicted class for a given data point x, predicted by finding the class which maximizes the probability of the observed data point given the observed parameters from training. This is expressed by the right hand side of the equation (in which C_k is the kth class, and x_i is the ith word in data point x):

enter image description here

And it can be solved using the standard method outlined here.

zen_of_python
  • 157
  • 2
  • 10