Using placeholder on empty string when training model with ML.NET

Question

I have a question regarding training an ML.NET that can predict if a name is female or not. The model can be trained with a pipeline like this:

var mlContext = new MLContext();
IDataView trainingDataView = mlContext.Data.LoadFromEnumerable(trainingData);
var dataPrepPipeline = mlContext
    .Transforms
    .Text
    .FeaturizeText("FirstNameFeaturized", "FirstName")
    .Append(mlContext.Transforms.Text.FeaturizeText("MiddleNameFeaturized", "MiddleName"))
    .Append(mlContext.Transforms.Text.FeaturizeText("LastNameFeaturized", "LastName"))
    .Append(mlContext.Transforms.Concatenate(
        "Features",
        "FirstNameFeaturized",
        "MiddleNameFeaturized",
        "LastNameFeaturized"))
    .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
    .AppendCacheCheckpoint(mlContext);

var prepPipeline = dataPrepPipeline.Fit(trainingDataView);
var preprocessedData = prepPipeline.Transform(trainingDataView);

var trainer = dataPrepPipeline.Append(mlContext
    .BinaryClassification
    .Trainers
    .AveragedPerceptron(labelColumnName: "IsFemale", numberOfIterations: 10, featureColumnName: "Features"));

ITransformer trainedModel = trainer.Fit(preprocessedData);

I have left out trainingData from the code. The model looks like this:

public class Person
{
    public string FirstName { get; set; }
    public string MiddleName { get; set; }
    public string LastName { get; set; }
    public bool IsFemale { get; set; }
}

I then fetch a list of persons from somewhere (database, csv, whatever) and convert it to Person objects.

As part of converting the persons to Person I'm using code looking like this:

var trainingData = new List<Person>();
trainingData.AddRange(persons.Select(p => new Person
{
    IsFemale = p.IsFemale,
    FirstName = p.FirstName ?? "unknown",
    MiddleName = p.MiddleName ?? "unknown",
    LastName = p.LastName ?? "unknown"
}));

You might be wondering why I insert unknown in case one of the name parts are null. This is done since building the ML.NET pipeline fails if any of the properties are null.

So here's my question. When setting name parts to unknown I would suspect this to produce a poor model. Example: If I have a male person with first name Thomas and I don't have the other parts, that would produce Thomas unknown unknown. Wouldn't that increase the probability of other persons being classified as not female if missing middle- and last name? Let's say we have a person named Anna and we don't have the remaining parts. This will produce Anna unknown unknown which is close to the other one already marked as non-female.

score 1 · Answer 1 · answered Jun 20 '22 at 03:59

Of course it will! You are introducing data to the set that cause most machine learning algorithms to lack precision.

There are some techniques that can be used to handle missing data, although in this example these are not numerical features of a person so the most reasonable way to handle these features not having data is to ignore the data missing these features completely when training the model.

If these features were numerical features of a person, such as weight or height, you could use techniques such as using the mean or mode value computed across the entire data set and use that value for the value of the missing feature data.

daniel_sweetser · Answer 2 · 2020-12-16T11:04:49.067

Using Microsoft.ML.AutoML 0.17.2 in .NET Core 3.1, and executing a binary classification experiment against a dataset that contains nulls, I'm finding that I receive no errors and a reasonable result if I scrub the null values and replace them with any string, including an empty string. My current pipeline is featurizing all the text columns in one go- I'm not sure if that makes a difference or not compared with what you are doing:

var options = new TextFeaturizingEstimator.Options();
options.KeepNumbers = true;
options.WordFeatureExtractor = null;
options.CharFeatureExtractor = null;
...
var initializer = mlContext.Transforms.Conversion.ConvertType("Label", "Column1", Microsoft.ML.Data.DataKind.Boolean)
.Append(mlContext.Transforms.Text.FeaturizeText("Features", options, propertyNames));                    
var initializedData = initializer.Fit(trainDataView).Transform(trainDataView);

But the key thing is that it looks like ML.NET doesn't appear to care WHAT you have, as long as it isn't null. I tried a number of filler values, such as "?", " ", "" and "_", and I got the most reasonable result with " ". Hope this makes sense and helps some with your problem.

Using placeholder on empty string when training model with ML.NET

2 Answers2