I am using a regression model to predict numeric values from a set of 120 attributes. 7 of these attributes are Categorical, however the largest category has about 90,000 unique values. I am training with approximately 1 million rows of data.
However, when I look at the Categorical attributes in the datasource summary I can see that these show a maximum of 5000 unique values. Is this some kind of limit that AWS Machine Learning is enforcing which is affecting the accuracy of my model, or is it just a limitation of the summary display?
Also, I have highlighted the Most frequent categories results where blank is shown as the most common value. (And this could be because of my CSV including quotes, and thus a valid value) Does AWS ML ignore blank entries for categorical elements? Or should I be populating missing categorical values with UUIDs/random strings so that a common shared 'blank' value doesn't skew predictions.
I understand that some ML models keep a spare neuron around for when new (previously unseen in training) categorical values are entered for predictions. Is this the case with AWS Machine Learning?
I am a ML novice, so sorry if my questions are stupid, or my methods/assumptions are wrong. I did scan the AWS documentation before asking.
Thanks.