2

I'm using WEKA tool for clustering data analysis, however in some of my attributes, there are many values within the domain. Specifically, I need to represent some information about proteins and the information that I need to include is the terms associated with their functions.

For example these values are include on the same attribute "Function":

"RNA-Binding protein", "RNA bindingstructural constituent of ribosomerRNA binding", "translation", "intracellularribosomeribonucleoprotein complex".

And these terms diversify hugely.

Can someone help me?

1 Answers1

2

A common approach is to split categorical variables with n different categories into n binary dummy variables.

For example:

gender = {male, female} can be rewritten with 2 dummy variables as:

  1. male = [0, 1]
  2. female = [1, 0]

In your case, it seems a function can contain several distinct values (e.g. 1 protein with several functions). This is easy to mold into dummy variables too.

Marc Claesen
  • 16,778
  • 6
  • 27
  • 62