ML.NET skipping columns from datasets

Question

I have a question. As we know, ML.NET is amazing framework for .NET, its doing a lot of things by "yourself" so sometimes its hard to get deep into inside.

I have dataset with 30 diffrent features. Im afraid of overfit, so im looking for the easiest way to delete not necessery ones.

For example, if i want to skip first column, can my Data.CS can look like this?:

    //skipped Column 0

    [Column(ordinal: "1")]
    public float RadiusMean;

    [Column(ordinal: "2")]
    public float TextureMean;

    [Column(ordinal: "3")]
    public float PerimeterMean;

I noticed, that we can do this by deleting columns from features;

pipeline.Add(new ColumnConcatenator(outputColumn: "Features",
            "TextureMean",
            "PerimeterMean",
            "AreaMean",
            //delete not necessery columns

And by this, we can improve our result. But if it works like "deleting" columns for training model?

Second question, if there is any faster way to make columns? Or maybe there is method in ML.NET to getting columns from dataset?

score 2 · Accepted Answer · answered Sep 16 '18 at 18:26

First question: removing the column from the input class in Data.cs means that the TextLoader will skip the column when reading in the file. This is probably the best option if you don't want to use it at all.

If you don't include the column in the "Features" column, it won't be included in training. The learner looks at the "Features" and "Label" columns by default, so other columns would not be used. However, you are still paying the cost of reading in the column. This might be useful if you want to use the column for feature engineering but not in training.

Second question: multiple columns can be read in as shown here. This reads in 784 numeric features into one column.

The new APIs will make it easier to read in many columns, as shown here. This reads in 10 columns into one "features" vector column.

With the new APIs, model introspection will be easier so you can see which features are significant to help you decide which ones to include.

Note: I am on the ML.NET team.

So, "name" in column definition dont need to be unic for every column? — michasaucer, Sep 16 '18 at 18:35
Which column definition are you referring to? Columns can be vectors (e.g. the examples I shared where multiple columns from the dataset are read in as a single column in ML.NET). You should not use the name to try and concatenate columns, but rather the `TextLoader` or `ColumnConcatenator` functionality. — Gal Oshri, Sep 16 '18 at 18:40

ML.NET skipping columns from datasets

1 Answers1