Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
0 answers

How to overcome SVMWithSGD that throws ArrayIndexOutOfBoundsException for index bigger that 5000?

In order to detect visitors demographics based on their behavior I used SVM algorithm from SPARK MLlib: JavaRDD data = MLUtils.loadLibSVMFile(sc.sc(), "labels.txt").toJavaRDD(); JavaRDD training = data.sample(false,…
1
vote
0 answers

Spark error on ALS trainImplicit: assertion failed: lapack.dppsv returned 1

I am getting below error when training ALS (implicit) using hadoop 2.6.1 and spark 1.5.2 on ubuntu 14 16/06/16 06:26:41 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/06/16 06:26:41 WARN BLAS: Failed to…
Smrutiranjan Sahu
  • 6,911
  • 2
  • 15
  • 12
1
vote
1 answer

How to randomy shuffe rows of an RDD in Spark?

I have an Rdd[String] and I want to shuffle all of the rows of this Rdd. How do I achieve this? For example: RDD object named rdd and you can run: rdd.collect.foreach(t => println(t)) has output: 1 2 3 4 I want to shuffe the rows of rdd so that…
user3494047
  • 1,643
  • 4
  • 31
  • 61
1
vote
0 answers

Spark MLLib ALS: Efficient mapping of misc user and product IDs to integer

I am attempting to build an online recommender system using the Spark recommendation ALS algorithm. My data resides in MongoDB, where I keep collections of users, items and ratings. The identifiers for these documents are of the default type…
Fulco
  • 284
  • 1
  • 3
  • 16
1
vote
1 answer

Spark MLlib recommender engine's methods

I'm using pySpark MLlib and the method of ALS from the box for collaborative filtering. Just wondering, does Spark provide some other methods of doing filtering (for calculating distance), for example Pearson's or Cosine's? Can they be done in Spark…
Keithx
  • 2,994
  • 15
  • 42
  • 71
1
vote
1 answer

How to choose combining strategy for MLlib's random forests

Is it possible to choose the combining strategy for MLlib's random forests? I can't find any clue on the official API docs. Here's my code: val numClasses = 10 val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 10 val…
Franjrg
  • 100
  • 1
  • 11
1
vote
0 answers

PredictionIO pio train fails with exception

I am setting up Prediction IO in my Unix machine. I am able to setup every thing required and now using the Lead Scoring template. I am successfully able to build the template using pio build --verbose command it says engine is ready to train.…
gaurav
  • 317
  • 1
  • 3
  • 10
1
vote
0 answers

Logistic regression scoring: java.lang.NumberFormatException

I am using Spark 1.5 and I would like to use logistic regression model that I have saved from my training phase for scoring on new dataset. Here is my sample data in libsvm file format: 1132106-2011-05-10 52:1 64:1 207:1 232:1 353:1 597:1 The first…
user3803714
  • 5,269
  • 10
  • 42
  • 61
1
vote
1 answer

Spark Streaming Model Overwrite

This so straightforward question. How can I save my updated model with the same name to same directory. org.apache.spark.sql.AnalysisException: path file:/home/mali/model/UpdatedmyRandomForestClassificationModel/data already exists There is SaveMode…
1
vote
1 answer

What about the files with smaller size than the hadoop block size: spark + machine learning

My hadoop block size if 128 MB and my file is 30 MB. And my cluster on which spark is running is a 4 node cluster with total of 64 cores. And now my task is to run a random forest or gradient boosting algorithm with paramater grid and 3-fold cross…
Abhishek
  • 3,337
  • 4
  • 32
  • 51
1
vote
0 answers

Are the spark ml libraries suitable for classifying instances one by one?

The Spark ml library proudly presents it's capability of model selection. I thought it fits my use case: In bigdata world: Train on many many labeled data points, do clever model selection by tuning parameters etc. and save the best model to…
1
vote
0 answers

spark-mllib gbdt algorithm questions

has anyone read the mllib gbdt code? i have some questions about this algorithm, i don't know how the program calculate the current node impurity, i only see the override calculate function in sub-class of Impurity ,in this function, parameter is…
lee li
  • 11
  • 1
1
vote
1 answer

Why is the StreamingKMeans cluster centers different vs regular Kmeans

I have two models trained using same data the KMeans model in like below: int numIterations = 20; int numClusters = 5; int runs = 10; double epsilon = 1.0e-6; KMeans kmeans = new KMeans(); kmeans.setEpsilon(epsilon); …
1
vote
1 answer

MLlib LogisticRegressionWithLBFGS error when using model.predict

I'm using MLlib's LogisticRegressionWithLBFGS to train a model with 4 classes. This is the code for preparing my data, val labeledTraining = trainingSetVectors.map{case(target,features) => LabeledPoint(target,features) }.cache() val…
other15
  • 839
  • 2
  • 11
  • 23
1
vote
3 answers

How to skip line in spark rdd map action based on if condition

I have a file and I want to give it to an mllib algorithm. So I am following the example and doing something like: val data = sc.textFile(my_file). map {line => val parts = line.split(","); Vectors.dense(parts.slice(1,…
user3494047
  • 1,643
  • 4
  • 31
  • 61