Multi-label text classification with scikit-learn

Question

I'm new to machine learning and I'm having trouble adapting any examples that I've found to my specific problem. The official documentation for scikit is rather spartan and full of terminology I'm unfamiliar with, so I'm not really sure which algorithm I should be using, how to properly prepare my data for it, and how to get the predictions in the form I want.

I already have my feature extraction function for the text in place, which returns a tuple of floats ranging from 0.0 to 100.0. These represent the prevalence of a certain characteristic in the text as a percentage. So my features for a certain piece of text would look something like (0.0, 17.31, 57.0, 93.2, ...). I'm unsure of which algorithm would be the most suitable for this type of data.

As per the title, I also need the ability to classify a piece of text using more than one label. Reading some other SO questions clued me in that I need to use MultiLabelBinarizer and OneVsRestClassifier, but I'm still unsure how to apply them to my data and whichever algorithm I'll need to use.

I also didn't find any examples that would return prediction results for the multiple labels in the form I want them. That is, instead of a binary "is or isn't this label", I'd like a percentage chance that the text is of a certain label. So when doing something like classifier.predict(testData) I'd like a return like {"spam":87.3, "code":27.9, "urlList":3.12} instead of something like ["spam", "code", "urlList"]. That way I can make more precise decisions about what to do with a certain text.

I should probably also mention one characteristic of the dataset that I'm using, and that is that 85-90% of the text will be code, and therefore only have one tag, "code". I imagine there are some tweaks to the algorithm required to account for this?

Some simplified and probably unsuitable code:

possibleLabels = ["code", "spam", "urlList"]
trainData, trainLabels = [ (0.0, 17.31, 57.0, 93.2), ... ], [ ["spam"], ["code"], ["code", "urlList"], ... ]
testData, testLabels = [], [] # Separate batch of samples in the same format as above
# Not sure if this is the proper way to prepare my labels,
# nor how to later resolve the binarized versions to their string counterparts. 
mlb = preprocessing.MultiLabelBinarizer()
fitTrainLabels = mlb.fit_transform(trainLabels)
# Feels like I need more to make it suitable for my data
classifier = OneVsRestClassifier()
classifier.fit(trainData, fitTrainLabels)
# Need return as a list of dicts containing probability of tags, ie. [ {"spam":87.3, "code":27.9, "urlList":3.12}, {...}, ... ]
predicted = classifier.predict(testData)

`The official documentation for scikit is rather spartan and full of terminology I'm unfamiliar with` --> and yet I think it's one of the clearest in the web. — MMF, Nov 21 '16 at 15:33
@MMF I imagine it's a lot clearer if you already have some knowledge of the subject and related terminology, but for someone like me who's just getting into it, it's pretty dense reading and it's hard to figure out what you should use. — Natsukane, Nov 21 '16 at 15:42
For percentage chance, look into the class method `predict_proba` (for example, for [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba)). And take a look at sklearn's docs on [Multiclass prediction](http://scikit-learn.org/stable/modules/multiclass.html), specifically the warning at the top: "All classifiers in scikit-learn do multiclass classification out-of-the-box." — jkr, Nov 21 '16 at 15:43
@Natsukane I'll advise you then to start with the theoretical part. Don't start coding until you know what you are doing. Learn how ML works. You have a lot of good documentation online. (cf . Tibshirani, Hastie & al) — MMF, Nov 21 '16 at 15:46
@MMF In principle I agree with the "theory first" approach, but you have to realize that there are real-world constraints. While I would love to know everything about everything, there's a finite amount of time in a day, and this is only part of a larger program, which is in turn one of multiple projects I need to work on. Learning 95% of theory that doesn't apply to my case in order to find the 5% that does is a waste, so I was hoping for someone who knows the theory to point me in the right direction. Rather than master the sword, sometimes you just need to poke someone with it and be done. — Natsukane, Nov 21 '16 at 16:16

Multi-label text classification with scikit-learn

0 Answers0