In order to detect visitors demographics based on their behavior I used SVM algorithm from SPARK MLlib:
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), "labels.txt").toJavaRDD();
JavaRDD<LabeledPoint> training = data.sample(false, 0.6, 11L);
training.cache();
JavaRDD<LabeledPoint> test = data.subtract(training);
// Run training algorithm to build the model.
int numIterations = 100;
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
// Clear the default threshold.
model.clearThreshold();
JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(new SVMTestMapper(model));
Unfortunately final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
throws ArrayIndexOutOfBoundsException
:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4857
labels.txt is a txt file composed from:
Visitor criteria(is male) | List[siteId: access number]
1 27349:1 23478:1 35752:1 9704:2 27896:1 30050:2 30018:1
1 36214:1 26378:1 26606:1 26850:1 17968:2
1 21870:1 41294:1 37388:1 38626:1 10711:1 28392:1 20749:1
1 29328:1 34370:1 19727:1 29542:1 37621:1 20588:1 42426:1 30050:6 28666:1 23190:3 7882:1 35387:1 6637:1 32131:1 23453:1
I tried with a lot of data and algorithms and as seen it gives an error for site Ids bigger than 5000.
Is there any solution to overcome it or there is another library for this issue? Or because the data is matrix is too sparse should use SVD?