1

I am working on a project involving CreateML and an MLLinearRegressor. For some reason, any time I attempt to predict a value that's not present in the training data, I get the same prediction every time. This happens both in Swift Playgrounds and when using the model in an Xcode project. Why might this be happening? I've posted my Swift Playgrounds code below.

import CreateML
import CoreML
import Foundation

do {
        let data: [String: MLDataValueConvertible] = [
     "Processor Name": ["A6", "A7", "A8", "A8X", "A9", "A9X", "A10X", "A10X", "A11"],
     "Geekbench Singlecore": [754, 1325, 1660, 1796, 2522, 3052, 3463, 3909, 4219]
     ]

    let CPURegressor = try MLLinearRegressor(trainingData: MLDataTable(dictionary: data), targetColumn: "Geekbench Singlecore", featureColumns: ["Processor Name"])

    let testData: [String: MLDataValueConvertible] = [
        "Processor Name": ["A6", "A7", "A8", "A8X", "A9", "A9X", "A10X", "A10X", "A11", "A12"],
        "Geekbench Singlecore": [754, 1325, 1660, 1796, 2522, 3052, 3463, 3909, 4219,0]
    ]

    print(try CPURegressor.predictions(from: MLDataTable(dictionary: testData))) // Notice how last (A12) and first (A6) values are the same
} catch {
    print(error)
}

Update: This is what my code looks like after adjusting my Processor Name category

import CreateML
import CoreML
import Foundation

do {
        let data: [String: MLDataValueConvertible] = [
     "Processor Name": [6.0, 7.0, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0],
     "Geekbench Singlecore": [754, 1325, 1660, 1796, 2522, 3052, 3463, 3909, 4219]
     ]

    print(try MLDataTable(dictionary: data))
    let CPURegressor = try MLRegressor(trainingData: MLDataTable(dictionary: data), targetColumn: "Geekbench Singlecore", featureColumns: ["Processor Name"])/*, parameters: MLBoostedTreeRegressor.ModelParameters(validationData: nil, maxDepth: 1000,
                                                                                                                                                                                                                              maxIterations: 1000,
                                                                                                                                                                                                                              minLossReduction: 1))*/
    /*CPURegressor.modelParameters = MLImageClassifier.ModelParameters(featureExtractor: .scenePrint(revision: 1),
                                                                     validationData: nil,
                                                                     maxIterations: 30,
                                                                     augmentationOptions: [])*/

  /*  let testData: [String: MLDataValueConvertible] = [
        "Processor Name": [0, 1, 2, 3, 4, 5, 6, 7, 8, 14],
        "Geekbench Singlecore": [1325, 1660, 1796, 2522, 3052, 3463, 3909, 4219,0, 1325]
    ]

    print(try CPURegressor.predictions(from: MLDataTable(dictionary: testData))) // Notice how last (A12) and first (A6) values are the same*/
} catch {
    print(error)
}
Jake3231
  • 703
  • 8
  • 22

1 Answers1

1

Linear regression computes an output value for a given input value, both of which have to be numeric. But your input values are not numeric, they are strings. So how does the linear regression know what "A12" is compared to all the other input values?

To a human it makes sense that A12 comes after A11, but since these are not numeric, the linear regression needs to turn them into numbers somehow but there's no way of telling how it will do this. So it's impossible to say where A12 lies on the "number line" (or where any of the other processors lie on that line).

In other words, you're using a categorical value as input to the linear regression, while linear regression can only handle real-valued inputs.

Try replacing "Processor Name" with [0, 1, 2, 3, 4, 5, 6, 7, 8] instead. Then ask the prediction for 9, which would be the A12 processor. (Not that doing this necessarily makes any sense, because this assumes that the difference between each processor generation is 1, but what does that mean?)

Also, you have A10X in your data twice.

Matthijs Hollemans
  • 7,706
  • 2
  • 16
  • 23
  • Thank you! I’ll certainly try switching out processor name, then asking the the prediction for 9. I’ve had the same issue when using an MLBoostedTreeRegressor as well. Do you think this will solve the issue there too? – Jake3231 Jul 30 '18 at 11:42
  • I have also had the same issue when using numbers and an MLBoostedTreeRegressor. Do you know what could be happening in this case? – Jake3231 Jul 30 '18 at 14:07
  • I'd need to see your actual code before I can answer that. – Matthijs Hollemans Jul 30 '18 at 19:09