1

I'm using weka for clustering binary data. Note that I use weka directly through the API or the source code.

My data input is a huge .csv file for example

attrib1, attrib2, atrib3
0,1,0
1,0,1
0,0,1

But in order to reduce the .csv size the data provider (I don't have direct access to the dataset) ignores zeros and the above snippet is writtern as

    attrib1, attrib2, atrib3
    ,1,
    1,,1
    ,,1

So i figured out that weka treats the value between two commas as a "Missing Value" (that's the term used in the code base) which I don't like.

I've been trying to work it out directly through the source code.

In particular the CSVLoader.getDataSet() and the CSVLoader.getInstance() along with ConverterUtils.getToken() seem to be responsible for this stuff.

I've tried a lot to change the code and make weka treat this null values (because that's what weka thinks they are) as zeros but I can't find the solution.

Can someone provide a better solution?

Flo
  • 1,367
  • 1
  • 13
  • 27

1 Answers1

1

Have you considered using the arff format?

A key benefit of the arff format is that it has a sparse variant.

Furthermore I guess you can add a custom parser somehow. So have you considered just modifying the CSV parser for your personal CSV variant? It shouldn't bee too hard to do.

Some algorithms (e.g. APRIORI) have parameters that allow treating missing values as 0.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194