Let's say I have created a model with ~30 items for each of 10 categories. I've taken all of the defaults that were provided to me.
The Average F1 Score for the model is 0.875 (I have 2 categories that are very closely related, so that's hurting accuracy a bit).
If I do a real-time prediction for a piece of text that should match positively for category 3 and 8, I get this result:
{
"Prediction": {
"details": {
"Algorithm": "SGD",
"PredictiveModelType": "MULTICLASS"
},
"predictedLabel": "8",
"predictedScores": {
"1": 0.002642059000208974,
"2": 0.010648942552506924,
"3": 0.41401588916778564,
"4": 0.02918998710811138,
"5": 0.008376320824027061,
"6": 0.009010250680148602,
"7": 0.006029266398400068,
"8": 0.4628857374191284,
"9": 0.04102163389325142,
"10": 0.01617990992963314
}
}
}
What I'm wondering is whether 3 & 8 both had effectively an ~80% certainty, but because they both matched the certainty was split between the two. If you sum all the predictedScores
, you get .999999997, which has me questioning whether there's a total 1.0 score that gets split amongst each of the available categories...
If I instead set up 10 different models, and did binary matches against each of them independently, would I see that 3 & 8 would score higher (e.g. something closer to 0.8)?
I guess a related question, that I don't really need answered but might help clarify the overall question, is ... If I had a theoretical piece of text that definitely fit all 10 categories, could Amazon Machine Learning respond with a predictedScore
value of 1.0 for each category? Or, because the maximum predictedScore
is 1.0, would it return 0.1 for each category?