0

I am trying to write a java program which calls CfsSubsetEval class in Weka to perform feature subset selection. CfsSubsetEval discretises the dataset, and I am trying to avoid that as the dataset is already discretized. The following are the lines from CfsSubsetEval.java that performs the discretization.

m_isNumeric = m_trainInstances.attribute(m_classIndex).isNumeric();

if (!m_isNumeric)
{
    m_disTransform = new Discretize();
    m_disTransform.setUseBetterEncoding(true);
    m_disTransform.setInputFormat(m_trainInstances);
    m_trainInstances = Filter.useFilter(m_trainInstances, m_disTransform);
}

Since the class attribute is defined in the arff file as follows:

@ATTRIBUTE class {true,false}

the attribute is not numeric, and hence the discretization is performed.

Although I have a little knowledge about Weka implementation, I tried to comment out these lines to skip the discretization. However, it did not work and the following exception is reported:

java.lang.ArrayIndexOutOfBoundsException: 1
at weka.attributeSelection.CfsSubsetEval.symmUncertCorr(CfsSubsetEval.java:515)
at weka.attributeSelection.CfsSubsetEval.correlate(CfsSubsetEval.java:445)
at weka.attributeSelection.CfsSubsetEval.evaluateSubset(CfsSubsetEval.java:392)
at weka.attributeSelection.BestFirst.search(BestFirst.java:806)
at weka.attributeSelection.AttributeSelection.SelectAttributes(AttributeSelection.java:606)
at selecting_features.runFeatureSelection.main(runFeatureSelection.java:39)

The question is: how can I change CfsSubsetEval.java so it does not discretise the dataset?

Your help is deeply appreciated.

user52732
  • 3
  • 3

1 Answers1

2

Symmetrical uncertainty is an entropy based measure that works on nominal attributes. weka.filters.supervised.attribute.Discretize will not alter any nominal attributes. You say that your input attributes are already discretized - are they actually integer valued attributes coded as Weka type numeric? If so, then you should preprocess the data using weka.filters.unsupervised.attribute.NumericToNominal. This will give you a nominal attribute with a list of labels that correspond to the distinct values for that attribute in the data. After doing this, the discretization process in CFS will leave your attributes untouched.

Cheers, Mark.