There are a number of techniques to "embed" categorical attributes as numbers.
For example, given a categorical variable that can take the values red
, green
and blue
, we can trivially encode this as three attributes isRed={0,1}
, isGreen={0,1}
and isBlue={0,1}
.
While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.
If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5
- you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0
.
I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).