It depends on how you process the categorical data.
If, for example, you used dictionary-based one-hot vectorizer:
new CategoricalOneHotVectorizer("Column2", "Column2", "Column3")
then the model will build a dictionary of terms per column:
Column1 -> [A, D, G]
Column2 -> [B, E, H]
Column3 -> [C, F, I]
If the value has not been seen (is not present in a dictionary), the CategoricalOneHotVectorizer
assigns zero to all the 'one-hot' slots. So your example A B Z
will turn into 1 0 0 1 0 0 0 0 0
.
If, on the other hand, you use hash-based one-hot encoding:
new CategoricalHashOneHotVectorizer("Column2", "Column2", "Column3")
the incoming value Z will be hashed in the same way as the seen values C, F and I, and this will activate one of the 2^HashBits
slots of the output column, based on the value of the hash.
The doc on the CategoricalOneHotVectorizer
is not very clear on this one, but it still says:
The Key value is the one-based index of the slot set in the Ind/Bag options. If the Key option is not found, it is assigned the value zero.