Best practices for handling non decimal variables. [ACM KDD 2009 CUP]

Question

For practice I decided to use neural network to solve problem of classification (2 classes) stated by ACM Special Interest Group on Knowledge Discovery and Data Mining at 2009 cup. The problem I have found is that the data set contains a lot of "empty" variables and I am not sure how to handle them. Furthermore second question appears. How to handle with other non decimals like strings. What are Your best practices?

score 1 · Accepted Answer · answered Oct 11 '12 at 11:10

1

Most approaches require numerical features, so the categorical ones have to be converted into counts. E.g. if a certain string is present among the attributes of an instance, it's count is 1, otherwise 0. If it occurs more than once, it's count increases correspondingly. From this point of view any feature that is not present (or "empty" as you put it) has a count of 0. Note that the attribute names have to be unique.

answered Oct 11 '12 at 11:10

Qnan

3,714
18
15

Yes that might be useful for categorical features but how about features which are numeral and have "empty" values? – d3r0n Oct 11 '12 at 18:28
This depends on what the feature corresponds to, but in many cases an "empty" value just means 0. – Qnan Oct 11 '12 at 22:21

Best practices for handling non decimal variables. [ACM KDD 2009 CUP]

1 Answers1