I'm new to machine learning and I'm having trouble adapting any examples that I've found to my specific problem. The official documentation for scikit is rather spartan and full of terminology I'm unfamiliar with, so I'm not really sure which algorithm I should be using, how to properly prepare my data for it, and how to get the predictions in the form I want.
I already have my feature extraction function for the text in place, which returns a tuple of floats ranging from 0.0 to 100.0. These represent the prevalence of a certain characteristic in the text as a percentage. So my features for a certain piece of text would look something like (0.0, 17.31, 57.0, 93.2, ...)
. I'm unsure of which algorithm would be the most suitable for this type of data.
As per the title, I also need the ability to classify a piece of text using more than one label. Reading some other SO questions clued me in that I need to use MultiLabelBinarizer
and OneVsRestClassifier
, but I'm still unsure how to apply them to my data and whichever algorithm I'll need to use.
I also didn't find any examples that would return prediction results for the multiple labels in the form I want them. That is, instead of a binary "is or isn't this label", I'd like a percentage chance that the text is of a certain label. So when doing something like classifier.predict(testData)
I'd like a return like {"spam":87.3, "code":27.9, "urlList":3.12}
instead of something like ["spam", "code", "urlList"]
. That way I can make more precise decisions about what to do with a certain text.
I should probably also mention one characteristic of the dataset that I'm using, and that is that 85-90% of the text will be code, and therefore only have one tag, "code". I imagine there are some tweaks to the algorithm required to account for this?
Some simplified and probably unsuitable code:
possibleLabels = ["code", "spam", "urlList"]
trainData, trainLabels = [ (0.0, 17.31, 57.0, 93.2), ... ], [ ["spam"], ["code"], ["code", "urlList"], ... ]
testData, testLabels = [], [] # Separate batch of samples in the same format as above
# Not sure if this is the proper way to prepare my labels,
# nor how to later resolve the binarized versions to their string counterparts.
mlb = preprocessing.MultiLabelBinarizer()
fitTrainLabels = mlb.fit_transform(trainLabels)
# Feels like I need more to make it suitable for my data
classifier = OneVsRestClassifier()
classifier.fit(trainData, fitTrainLabels)
# Need return as a list of dicts containing probability of tags, ie. [ {"spam":87.3, "code":27.9, "urlList":3.12}, {...}, ... ]
predicted = classifier.predict(testData)