multiclass prediction with text and numeric data

Question

I am trying to create a prediction model (or classification) for a dataset which includes numeric and text features Using Tf-IdfVectorizer, I have managed to convert text columns into lists so each cell in the text column is a list of float numbers such as [0.0 0.3567 0.0 0.0] (without commas). my target feature is a set of classes. each row can have multiple values such as

[a, b, c, 1]
[1, d]
[]

the question is how can pre-process the target variable so that my model makes classification predictions? I have tried label encoding but it creates new encoding for each row so same integer is encoded to different classes at different rows.

I am planning to accept all the predictions for each row over a certain threshold. Is there a model also supporting this ? Many thanks in advance

This is a multi-label classification problem. Try [MultilabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) on the targets and then [use algorithms from here](http://scikit-learn.org/stable/modules/multiclass.html#multiclass-and-multilabel-algorithms) that support it. — Vivek Kumar, Oct 13 '17 at 01:13
@VivekKumar so can I simply pass a matrix to fit(x, y) method of the classifier rather than a 1D list ? (as y variable) — emrahozkan, Oct 13 '17 at 12:39
Yes, thats correct. Please add some sample info for X and y along with your code and we can give you a working example. — Vivek Kumar, Oct 13 '17 at 13:36
hello, I am almost about to get some results, I will send a part of the code later on it will be more efficient. thanks — emrahozkan, Oct 13 '17 at 15:01

score 0 · Answer 1 · answered Oct 12 '17 at 22:23

0

One way is to train classifier against each tag individually (it will be binary classification whether each sample has a certain tag). Another idea it to binarize tags and make multiclass classification but remove softmax function in the end (it normalizes log probabilities to sum to 1) and apply logistic loss for each tag.

Keras will be pretty easy to use here.

answered Oct 12 '17 at 22:23

Alex Ozerov

988
8
21

I have around 30 tags, so it will be quite costly I guess. – emrahozkan Oct 12 '17 at 22:28
@emrahozkan it mostly depend on your algorithm and number of features/samples. 30 target variables isn’t much at all. – Alex Ozerov Oct 12 '17 at 22:31

multiclass prediction with text and numeric data

1 Answers1