1

I am querying the Azure Custom Vision V3.0 Training API (see https://westeurope.dev.cognitive.microsoft.com/docs/services/Custom_Vision_Training_3.0/operations/5c771cdcbf6a2b18a0c3b809) so I can generate per-tag ROCs myself via the GetIterationPerformance operation, part of whose output is:

{u'averagePrecision': 0.92868346,
 u'perTagPerformance': [{u'averagePrecision': 0.4887446,
                         u'id': u'uuid1',
                         u'name': u'tag_name_1',
                         u'precision': 0.0,
                         u'precisionStdDeviation': 0.0,
                         u'recall': 0.0,
                         u'recallStdDeviation': 0.0},
                        {u'averagePrecision': 1.0,
                         u'id': u'uuid2',
                         u'name': u'tag_name_2',
                         u'precision': 0.0,
                         u'precisionStdDeviation': 0.0,
                         u'recall': 0.0,
                         u'recallStdDeviation': 0.0},
                    {u'averagePrecision': 0.9828302,
                     u'id': u'uuid3',
                     u'name': u'tag_name_3',
                     u'precision': 1.0,
                     u'precisionStdDeviation': 0.0,
                     u'recall': 0.5555556,
                     u'recallStdDeviation': 0.0}],

u'precision': 0.9859485, u'precisionStdDeviation': 0.0, u'recall': 0.3752228, u'recallStdDeviation': 0.0}

The precision and recall uncertainties, precisionStdDeviation and recallStdDeviation respectively, always seem to be 0.0. Is this user error and if not are there any plans to activate these stats?

jtlz2
  • 7,700
  • 9
  • 64
  • 114

1 Answers1

3

So currently precisionStdDeviation and recallStdDeviation are not used so it will always be zero, it is not user error. These two metric exists because previously we do a cross validation on user dataset, and for each cross validation fold we have a precision and recall, the stddev measures the variation of precision and recall across folds. Now instead of cross validation, we split a proportion of the user data as validation set and report IterationPerformance based on that, as there's no multiple folds, the stddev will be always be zero. We are on our plan to retire these two fields, sorry for the confusion, it will be very likely to be removed in the next major version.

Kuan Lu
  • 46
  • 1
  • Thanks for the speedy answer! Could you add some extra detail on the other metrics in the payload, and plans for them, and then I'm good to go? :) Specifically: What's the difference between `precision` and `averagePrecision` etc. in this case? Thanks! – jtlz2 May 31 '19 at 15:03
  • 1
    Glad to answer that. 1. The plans of the payloads: we will retire the `StdDeviations` and the rest will stay the same. 2. For the metrics in the payload we calculate the precision and recall by definition: Precision = TP/(TP+FP) Recall = TP/(TP+FN) and the TP, FP and FN depends on your choice of threshold (the third parameter in the GetIterationPerformance API), say we have apple image and the model predicts it's an apple with 60% confidence. Now if you set threshold over 0.6 this sample will be a False negative, if under 0.6 this sample will be a True Positive. – Kuan Lu May 31 '19 at 20:31
  • 1
    Average Precision(AP), however, it is a summarization of the precision and recall at different threshold, detailed explanation can be found [here](https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173). So with AP you can have an idea of how well the model is performing overall. A model with high precision and recall at different threshold will have higher AP. Another thing you will notice if you change the threshold around is that the APs will always be the same, as the AP is not related to a single threshold but all the thresolds as a whole. – Kuan Lu May 31 '19 at 20:32
  • Any chance you or a colleague might be able to assist with this Q too? https://stackoverflow.com/questions/56426097/azure-cognitive-services-custom-vision-how-do-i-design-an-appropriate-multi-lab :) – jtlz2 Jun 03 '19 at 11:10
  • Yes, I see that AP is not a function of the requested `threshold`. Also by Average I assume you mean Mean rather than Median or Mode :) Are there any plans to enrich the metrics available? We are somewhat in your hands at the present time.... Thanks again! – jtlz2 Jun 03 '19 at 12:00
  • Yes, AP works kind of like mean precision at different threshold. Are there any additional metrics that you need in particular or make sense to you? we are always open for feedback so we perhaps can add some more. – Kuan Lu Jun 03 '19 at 17:31
  • sure, we will take a look at that Q – Kuan Lu Jun 03 '19 at 17:33
  • many thanks. Would it be possible to be in touch directly via email and if so please could you track me down on the web to discuss? Not sure if we are allowed to exchange details on here. I have lots of questions but it's really hard to get in touch with the people that know such as yourself. Huge thanks in advance. – jtlz2 Jun 04 '19 at 08:19