Numerically representing Nominal Data whilst retaining data semantics

Question

I have a dataset of nominal and numerical features. I want to be able to represent this dataset entirely numerically if possible.

Ideally I would be able to do this for an n-ary nominal feature. I realize that in the binary case, one could represent the two nominal values with integers. However, when a nominal feature can have many permutations, how would this be possible, if at all?

score 2 · Accepted Answer · edited May 12 '17 at 19:51

There are a number of techniques to "embed" categorical attributes as numbers.

For example, given a categorical variable that can take the values red, green and blue, we can trivially encode this as three attributes isRed={0,1}, isGreen={0,1} and isBlue={0,1}.

While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.

If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5 - you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0.

I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).

score 1 · Answer 2 · answered Nov 29 '13 at 00:44

Don't do this: I'm trying to encode certain nominal attributes as integers.

Except if there is only two permutations for a nominal feature. It is ok to use any different integers (for example 1 and 3) for each.

But if there is more than two permutations, integers can not be used. Lets say we assigned 1, 2 and 3 to three permutations. As we can see, there is higher relation between 1-2 and 2-3 than 1-3 because of differences.

Rather, use a separate binary feature for each value of each nominal attribute. Thus, the answer of your question: It is not possible/wisely.

neox · Answer 3 · 2017-05-12T21:10:44.733

0

If you use pandas, you can use a function called .get_dummies() on your nominal value column. This will turn the column of N unique values into N (or if you want N-1, called drop_first) new columns indicating with either a 1 or a 0 if a value is present.

Example:

s = pd.Series(list('abca'))

get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

edited May 12 '17 at 21:10

answered May 12 '17 at 21:05

neox

81
6

Numerically representing Nominal Data whilst retaining data semantics

3 Answers3