Microsoft.ML rel 0.8.0 MLContext with a text file with different data types

Question

I am trying to use ML.Net with a csv file. The file contains

price data (decimal)
Enum's (different types as string)
Statistical data (float)

I'm trying to follow the sample provided in MlNetCookBook however I can't use

FeatureVector: r.DecimalField1.ConcatWith(r.DecimalField2, r.EnumType1,r.EnumType2,r.FloatField1,...)

as the types are not compatible and I would like to "Dictionarizer" the enum values.

does anyone know how this should be configured with the new API?

Thanks

You should probably OneHotEncode your enum fields beforhand. — ClojureMostly, Nov 02 '18 at 07:49
in the new API this would be a .ToKey() I think, not sure where to "stick" it, if I do .Append(r =>r.RTH.ToKey()) the whole thing starts underlining red... — Walter Verhoeven, Nov 02 '18 at 10:23

score 1 · Accepted Answer · answered Nov 02 '18 at 20:01

1

Dictionarizer()/ToKey() are useful for dealing with string labels in classification problems. The output is of type "Key" which cannot be concatenated with the numeric features that you have.

For the categorical (enum) features, you'll probably want to use OneHotEncoding as @ClojureMostly mentioned: r.RTH.OneHotEncoding(). This will output a vector of floats which can then be concatenated with the other numeric features you have.

If you are still seeing an error, would you be able to share your TextLoader and your estimator pipeline?

answered Nov 02 '18 at 20:01

Gal Oshri

396
1
2

Hi, this did it, I added the encoding and the errors are gone. the error I had was I assumed I needed a separate .Append for this, I just inline it with the rest of them. – Walter Verhoeven Nov 02 '18 at 22:05
Now I get "Training set has 0 instances, aborting training.", there are 165 rows in my unit test, is this a normalisation error? – Walter Verhoeven Nov 05 '18 at 10:53
Were there any other changes? Did you add any filters that might have removed rows from the dataset? – Gal Oshri Nov 06 '18 at 20:47
The only issue that I see is that I have more than 1 data type (enum, float,decimal) normalisation seems to be the issue even if i merge them later in the pipeline – Walter Verhoeven Nov 08 '18 at 12:51
Does the issue only happen when you add normalization? Normalization should be fine for these data types once they are processed to a numeric vector and concatenated. – Gal Oshri Nov 09 '18 at 23:40
One thing that might cause "Training set has 0 instances, aborting training" is if you are using the default separator in the TextReader (tabs) but you have a csv. Make sure to include `separator: ','` – Gal Oshri Nov 09 '18 at 23:41

Microsoft.ML rel 0.8.0 MLContext with a text file with different data types

1 Answers1