0

I am using a regression model to predict numeric values from a set of 120 attributes. 7 of these attributes are Categorical, however the largest category has about 90,000 unique values. I am training with approximately 1 million rows of data.

However, when I look at the Categorical attributes in the datasource summary I can see that these show a maximum of 5000 unique values. Is this some kind of limit that AWS Machine Learning is enforcing which is affecting the accuracy of my model, or is it just a limitation of the summary display?

AWS Categorical Attribute Summary

Also, I have highlighted the Most frequent categories results where blank is shown as the most common value. (And this could be because of my CSV including quotes, and thus a valid value) Does AWS ML ignore blank entries for categorical elements? Or should I be populating missing categorical values with UUIDs/random strings so that a common shared 'blank' value doesn't skew predictions.

I understand that some ML models keep a spare neuron around for when new (previously unseen in training) categorical values are entered for predictions. Is this the case with AWS Machine Learning?

I am a ML novice, so sorry if my questions are stupid, or my methods/assumptions are wrong. I did scan the AWS documentation before asking.

Thanks.

Sprooose
  • 504
  • 1
  • 6
  • 17
  • You're using large number of attributes so its apparent that there is no scientific reasoning for attributes selection and we've put all the attributes for learning without *attribute selection**. Some attributes may be **in-significant to learning** or causing **reverse-learning** as well as one could've not captured relevant attributes. I can see correlation factors very low around 0.5.Even experts miss out on this aspect when using large data. Use PCA to improve network and there is no spare neuron, **all input data converges to outputs trained**. – SACn Mar 10 '17 at 10:34

1 Answers1

1

It usually doesn't make much sense to use so many category values, and only the top values will be used as the other smaller categories don't have much predictive power.

These categories have a very high correlation to the target, which is a bit suspicious. But if the model is working well with them, I wouldn't be too worry. You can try to build the model without them, to see if it makes any difference, but I won't work too hard on selecting features, and more on adding more potential ones.

Guy
  • 12,388
  • 3
  • 45
  • 67
  • Thanks. Yes I have wanted to replace the categorical attributes with numeric attributes that uniquely "describe" those categories. I will compare the two outputs. I just wanted to know what limitations I was up against with AWS.. Do you know if AWS ML has a hard limit of using the most "useful" 5000 categorical values.. or are you just talking about how machine learning works in general with many categorical values.? – Sprooose Mar 20 '17 at 21:44