In Weka, how can I stop CfsSubsetEval from discretizing training instances?

Question

I am trying to write a java program which calls CfsSubsetEval class in Weka to perform feature subset selection. CfsSubsetEval discretises the dataset, and I am trying to avoid that as the dataset is already discretized. The following are the lines from CfsSubsetEval.java that performs the discretization.

m_isNumeric = m_trainInstances.attribute(m_classIndex).isNumeric();

if (!m_isNumeric)
{
    m_disTransform = new Discretize();
    m_disTransform.setUseBetterEncoding(true);
    m_disTransform.setInputFormat(m_trainInstances);
    m_trainInstances = Filter.useFilter(m_trainInstances, m_disTransform);
}

Since the class attribute is defined in the arff file as follows:

@ATTRIBUTE class {true,false}

the attribute is not numeric, and hence the discretization is performed.

Although I have a little knowledge about Weka implementation, I tried to comment out these lines to skip the discretization. However, it did not work and the following exception is reported:

java.lang.ArrayIndexOutOfBoundsException: 1
at weka.attributeSelection.CfsSubsetEval.symmUncertCorr(CfsSubsetEval.java:515)
at weka.attributeSelection.CfsSubsetEval.correlate(CfsSubsetEval.java:445)
at weka.attributeSelection.CfsSubsetEval.evaluateSubset(CfsSubsetEval.java:392)
at weka.attributeSelection.BestFirst.search(BestFirst.java:806)
at weka.attributeSelection.AttributeSelection.SelectAttributes(AttributeSelection.java:606)
at selecting_features.runFeatureSelection.main(runFeatureSelection.java:39)

The question is: how can I change CfsSubsetEval.java so it does not discretise the dataset?

Your help is deeply appreciated.

score 2 · Accepted Answer · answered Nov 05 '14 at 08:30

Symmetrical uncertainty is an entropy based measure that works on nominal attributes. weka.filters.supervised.attribute.Discretize will not alter any nominal attributes. You say that your input attributes are already discretized - are they actually integer valued attributes coded as Weka type numeric? If so, then you should preprocess the data using weka.filters.unsupervised.attribute.NumericToNominal. This will give you a nominal attribute with a list of labels that correspond to the distinct values for that attribute in the data. After doing this, the discretization process in CFS will leave your attributes untouched.

Cheers, Mark.

In Weka, how can I stop CfsSubsetEval from discretizing training instances?

1 Answers1