I have a question regarding training an ML.NET that can predict if a name is female or not. The model can be trained with a pipeline like this:
var mlContext = new MLContext();
IDataView trainingDataView = mlContext.Data.LoadFromEnumerable(trainingData);
var dataPrepPipeline = mlContext
.Transforms
.Text
.FeaturizeText("FirstNameFeaturized", "FirstName")
.Append(mlContext.Transforms.Text.FeaturizeText("MiddleNameFeaturized", "MiddleName"))
.Append(mlContext.Transforms.Text.FeaturizeText("LastNameFeaturized", "LastName"))
.Append(mlContext.Transforms.Concatenate(
"Features",
"FirstNameFeaturized",
"MiddleNameFeaturized",
"LastNameFeaturized"))
.Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
.AppendCacheCheckpoint(mlContext);
var prepPipeline = dataPrepPipeline.Fit(trainingDataView);
var preprocessedData = prepPipeline.Transform(trainingDataView);
var trainer = dataPrepPipeline.Append(mlContext
.BinaryClassification
.Trainers
.AveragedPerceptron(labelColumnName: "IsFemale", numberOfIterations: 10, featureColumnName: "Features"));
ITransformer trainedModel = trainer.Fit(preprocessedData);
I have left out trainingData
from the code. The model looks like this:
public class Person
{
public string FirstName { get; set; }
public string MiddleName { get; set; }
public string LastName { get; set; }
public bool IsFemale { get; set; }
}
I then fetch a list of persons from somewhere (database, csv, whatever) and convert it to Person
objects.
As part of converting the persons to Person
I'm using code looking like this:
var trainingData = new List<Person>();
trainingData.AddRange(persons.Select(p => new Person
{
IsFemale = p.IsFemale,
FirstName = p.FirstName ?? "unknown",
MiddleName = p.MiddleName ?? "unknown",
LastName = p.LastName ?? "unknown"
}));
You might be wondering why I insert unknown
in case one of the name parts are null. This is done since building the ML.NET pipeline fails if any of the properties are null.
So here's my question. When setting name parts to unknown
I would suspect this to produce a poor model. Example: If I have a male person with first name Thomas
and I don't have the other parts, that would produce Thomas unknown unknown
. Wouldn't that increase the probability of other persons being classified as not female if missing middle- and last name? Let's say we have a person named Anna
and we don't have the remaining parts. This will produce Anna unknown unknown
which is close to the other one already marked as non-female.