0

I am trying to create a prediction model (or classification) for a dataset which includes numeric and text features Using Tf-IdfVectorizer, I have managed to convert text columns into lists so each cell in the text column is a list of float numbers such as [0.0 0.3567 0.0 0.0] (without commas). my target feature is a set of classes. each row can have multiple values such as

[a, b, c, 1]
[1, d]
[]

the question is how can pre-process the target variable so that my model makes classification predictions? I have tried label encoding but it creates new encoding for each row so same integer is encoded to different classes at different rows.

I am planning to accept all the predictions for each row over a certain threshold. Is there a model also supporting this ? Many thanks in advance

emrahozkan
  • 193
  • 1
  • 3
  • 15
  • This is a multi-label classification problem. Try [MultilabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) on the targets and then [use algorithms from here](http://scikit-learn.org/stable/modules/multiclass.html#multiclass-and-multilabel-algorithms) that support it. – Vivek Kumar Oct 13 '17 at 01:13
  • @VivekKumar so can I simply pass a matrix to fit(x, y) method of the classifier rather than a 1D list ? (as y variable) – emrahozkan Oct 13 '17 at 12:39
  • Yes, thats correct. Please add some sample info for X and y along with your code and we can give you a working example. – Vivek Kumar Oct 13 '17 at 13:36
  • hello, I am almost about to get some results, I will send a part of the code later on it will be more efficient. thanks – emrahozkan Oct 13 '17 at 15:01

1 Answers1

0

One way is to train classifier against each tag individually (it will be binary classification whether each sample has a certain tag). Another idea it to binarize tags and make multiclass classification but remove softmax function in the end (it normalizes log probabilities to sum to 1) and apply logistic loss for each tag.

Keras will be pretty easy to use here.

Alex Ozerov
  • 988
  • 8
  • 21