How to overcome SVMWithSGD that throws ArrayIndexOutOfBoundsException for index bigger that 5000?

Question

In order to detect visitors demographics based on their behavior I used SVM algorithm from SPARK MLlib:

JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), "labels.txt").toJavaRDD();

    JavaRDD<LabeledPoint> training = data.sample(false, 0.6, 11L);
    training.cache();
    JavaRDD<LabeledPoint> test = data.subtract(training);

    // Run training algorithm to build the model.
    int numIterations = 100;
    final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);

    // Clear the default threshold.
    model.clearThreshold();

    JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(new SVMTestMapper(model));

Unfortunately final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations); throws ArrayIndexOutOfBoundsException :

Caused by: java.lang.ArrayIndexOutOfBoundsException: 4857

labels.txt is a txt file composed from:

Visitor criteria(is male) | List[siteId: access number]

1 27349:1 23478:1 35752:1 9704:2 27896:1 30050:2 30018:1

1 36214:1 26378:1 26606:1 26850:1 17968:2

1 21870:1 41294:1 37388:1 38626:1 10711:1 28392:1 20749:1

1 29328:1 34370:1 19727:1 29542:1 37621:1 20588:1 42426:1 30050:6 28666:1 23190:3 7882:1 35387:1 6637:1 32131:1 23453:1

I tried with a lot of data and algorithms and as seen it gives an error for site Ids bigger than 5000.

Is there any solution to overcome it or there is another library for this issue? Or because the data is matrix is too sparse should use SVD?

That exception indicates a bug. Either in your code or the code you are calling. Without further information it is impossible to tell which. — Raedwald, Jun 16 '16 at 08:45
[Raedwald] The idea is that for an siteId smaller than 1000 SVMWithSGD.train is working and don't throw this exception. — Iura Gaitur, Jun 16 '16 at 09:29
you simply have a bug in your code, there is nothng special about 5000. My guess would be: you independently load train and test datasets, both in sparse formats, and the biggest feature id in the train set is smaller than in the test set, thus the train / test matrices have different dimensions — lejlot, Jun 16 '16 at 21:15
@lejlot The idea is that it gives me this error when I try to create the svm model by using training data. About the datasets they are both from same txt file. Appears the question: should I transform the dataset because it is too sparse using SVD? — Iura Gaitur, Jun 17 '16 at 08:56
Please include your whole code and data, otherwise it is really hard to help you — lejlot, Jun 17 '16 at 09:03
@lejlot Thanks for help. I changed the question. As you can see I used the tutorial from Spark website. Could it be that it gives me this error because of data sparseness? And should first do a SVD? — Iura Gaitur, Jun 17 '16 at 10:12

How to overcome SVMWithSGD that throws ArrayIndexOutOfBoundsException for index bigger that 5000?

0 Answers0