2

When i try this method with dense vectors data it's run correctly, but with sparse vectors data throws java.lang.ArrayIndexOutOfBoundsException. What datasource can i use to read sparse vectors data correctly?

public void runKmeans(double[][] data) {
ArrayAdapterDatabaseConnection dataArray = new ArrayAdapterDatabaseConnection(data);

ListParameterization params = new ListParameterization();
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dataArray);

Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();


// Parameterization
//params = new ListParameterization();
params = new ListParameterization();
params.addParameter(KMeans.K_ID, k);
params.addParameter(KMeans.SEED_ID, 0);


// setup Algorithm
KMeansOutlierDetection<DoubleVector> kmeansAlg = ClassGenericsUtil.parameterizeOrAbort(KMeansOutlierDetection.class, params);
//testParameterizationOk(params);

// run KMEANS on database
OutlierResult result = kmeansAlg.run(db);
...
Wesin Alves
  • 371
  • 1
  • 3
  • 13
  • If I'm not mistaken, `ArrayAdapterDatabaseConnection` only supports *dense* data. `DoubleVector` also is a *dense* data type. Any chance you misinterpreted the format of `data`? – Has QUIT--Anony-Mousse Jan 19 '16 at 22:00
  • 2
    Apart from that, K-means does not make *sense* on sparse data. Whatever you are trying to do - it is the wrong algorithm. – Has QUIT--Anony-Mousse Jan 19 '16 at 22:01
  • I' ve first used ArrayList to read, and so, i' ve used toArray method to populate it. I've tried to use kmeans because when i use Elki's Gui, Kmeans can run my sparse data in arff format. Why by Elki's Gui can Kmeans run but by code cannot – Wesin Alves Jan 20 '16 at 01:39
  • Of course you can code that, but these classes *don't* do that. A `double[]` is a *dense* format. You can build your own data source, or use the Arff data source, or, or, or, ... but in the end, k-means assumes dense data from a *theoretical* point of view. – Has QUIT--Anony-Mousse Jan 20 '16 at 02:23
  • How would you meaningfully put a *sparse* vector in that ArrayList? – Has QUIT--Anony-Mousse Jan 20 '16 at 10:27
  • I read incrementaly a arff sparse file like this `@data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}`, and so, i use `arrlist.add(data)`. When stop reading, i do `inputMatrix = new double[arrlist.size()][arrlist.get(0).length]; inputMatrix = arrlist.toArray(inputMatrix);` i do this beacause in my program i can't read directly arff file , i need to read it incrementaly. – Wesin Alves Jan 20 '16 at 13:35
  • 2
    To explain why **k-means does not make sense on sparse data**: k-means uses the *mean*. It assumes a fixed dimensionality d, too. Averaging sparse vectors of different length destoys all the nice mathematic support for the algorithm. – Has QUIT--Anony-Mousse Jan 20 '16 at 13:35
  • And how do you *store* that data in a `double[]`? Why don't you put all that code into the question? Also, it looks like categorial data, k-means requires continuous data. – Has QUIT--Anony-Mousse Jan 20 '16 at 13:38
  • it was just a example. My data is really continuous data. i declare inputMatrix as double[][]. After store it, i call `runKmeans(inputMatrix);` – Wesin Alves Jan 20 '16 at 13:46
  • i agree kmeans doesn't make sense for sparse data. I just think strange Elki's Gui can run arff sparse file without throw java.lang.ArrayIndexOutOfBoundsException too. – Wesin Alves Jan 20 '16 at 13:55
  • But `double[][]` is *dense*, and you wanted *sparse* data! Whatever you are doing (add the preparation code to your question, please), you are *not* creating sparse vectors. – Has QUIT--Anony-Mousse Jan 20 '16 at 15:11
  • Also, the way you seem to initialize it (with `toArray`) makes it likely non-square (because of ragged arrays). Maybe this is causing the error. Verify that *all* rows have the same length (but again, this is a *dense* format that you are using). – Has QUIT--Anony-Mousse Jan 20 '16 at 15:12
  • rows doesn't have same length. So, are you trying tell me i need to use something like `SparseDoubleVector` from Elki to store sparse data? – Wesin Alves Jan 20 '16 at 16:27

1 Answers1

1

The class ArrayAdapterDatabaseConnection can only be used for dense vectors. You must supply a square double[][] array.

You can use FileBasedDatabaseConnection and the ArffParser to read sparse data. Or you can implement your own DatabaseConnection, it is a single method only, loadData().

DoubleVector is a dense data type. SparseDoubleVector is a sparse vector type. To do this, DoubleVector is backed using a dense double[] array, whereas SparseDoubleVector uses a int[] with the nonzero dimensions, plus a double[] with the nonzero values only.

K-means requires a fixed dimensionality to allocate the mean vectors (these will always be dense), so make sure to supply a VectorFieldTypeInformation with the maximum dimensionality. There is a type conversion filter that simply scans you data set once, and sets the dimension accordingly.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42