0


When using the Google Prediction API (v1.6) for classification, I get different behavior when using "insert" to train the model versus "update".

If I upload a csv file to storage and train (insert) using it or use the insert method and include the training data in the request, the results is the same. (I.e. which insert method I use doesn't matter).

However, creating an empty model via insert and then adding all the data via updates yields a different result.

The values of prediction probabilities are very different and the model created via the insert doesn't seem to be affected by updates after the initial training.

Using the Insert, the prediction probabilities for "Addr12" are:
Predicting: Addr12
Prob: 0.071895 Label: Logon Name
Prob: 0.039216 Label: State
Prob: 0.000000 Label: Logon Type
Prob: 0.013072 Label: SSN
Prob: 0.052288 Label: Employee Number
Prob: 0.032680 Label: First Name
Prob: 0.071895 Label: Middle Name
Prob: 0.052288 Label: Last Name
Prob: 0.071895 Label: Date Of Birth
Prob: 0.098039 Label: Gender
Prob: 0.006536 Label: Eligibility Class
Prob: 0.019608 Label: Location
Prob: 0.104575 Label: Address 1
Prob: 0.111111 Label: Address 2
Prob: 0.026144 Label: City
Prob: 0.058824 Label: Zip
Prob: 0.091503 Label: Date Of Hire
Prob: 0.078431 Label: Hours Worked Per Week

Using the Update, the prediction probabilities for "Addr12" are:
Predicting: Addr12
Prob: 0.000000 Label: Hours Worked Per Week
Prob: 0.000000 Label: Date Of Hire
Prob: 0.000000 Label: Zip
Prob: 0.000000 Label: State
Prob: 0.000000 Label: City
Prob: 0.527513 Label: Address 2
Prob: 0.472487 Label: Address 1
Prob: 0.000000 Label: Location
Prob: 0.000000 Label: Eligibility Class
Prob: 0.000000 Label: Gender
Prob: 0.000000 Label: Date Of Birth
Prob: 0.000000 Label: Last Name
Prob: 0.000000 Label: Middle Name
Prob: 0.000000 Label: First Name
Prob: 0.000000 Label: Employee Number
Prob: 0.000000 Label: SSN
Prob: 0.000000 Label: Logon Type
Prob: 0.000000 Label: Logon Name

Lastly, the output of Analyze after using insert contains the dataDescription/outputFeature/text plus the modelDescription and confusionMatrix. The output of Analyze after using the update doesn't contain the modelDescription and confusionMatrix (no I'm not simple excluding those fields in the output).

Anybody have success using insert to train an initial model while being able to use update to improve it?

----- Ed

DigitalEd
  • 260
  • 1
  • 3
  • 10
  • Digging in a bit more, after using insert to initially train and then adding more data via update, the Analyze sees the new data (dataDescription.outputFeature.text and features get updated) but the confusion matrix never changes. Ever. Even when new labels are added. – DigitalEd Jul 14 '14 at 20:01
  • Another interesting discovery. The training data I had was already grouped by category. I took the training file and shuffled the entries and the behavior of the system changes. The updates do seem to work. – DigitalEd Jul 15 '14 at 14:55

0 Answers0