Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
2 answers

What is the reason for compilation errors if different version of Spark-core and Spark-mllib are mixed?

I am copying and pasting the exact Spark MLlib LDA example from here: http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda I am trying the Scala sample code, but I am having the following errors when I am trying…
Rami
  • 8,044
  • 18
  • 66
  • 108
1
vote
1 answer

Spark python MLlib Random Forest out of memory error

I am running spark 1.2.1 to train a random forest. I have a master and a worker node setup on AWS EC2 with total 96GB of memory allocated to spark. I played with various parallelism values (32, 64, 6400) and I keep getting the same error. According…
foboi1122
  • 1,727
  • 4
  • 19
  • 36
1
vote
1 answer

spark pyspark mllib model - when prediction rdd is generated using map, it throws exception on collect()

I am using spark 1.2.0 (cannot upgrade as I dont have control over it). I am using mllib to build a model points = labels.zip(tfidf).map(lambda t: LabeledPoint(t[0], t[1] )) train_data, test_data = points.randomSplit([0.6, 0.4], 17) iterations =…
Abhishek
  • 33
  • 6
1
vote
1 answer

Broadcast Random-Forest Model in PySpark

I'm using spark 1.4.1. When i'm trying to broadcast random forest model it shows me this error: Traceback (most recent call last): File "/gpfs/haifa/home/d/a/davidbi/codeBook/Nice.py", line 358, in broadModel = sc.broadcast(model) File…
1
vote
1 answer

How to serialize apache spark's MatrixFactorizationModel in Java

I am building a recommendation system using Apache Spark MLlib and Java. Once the MatrixFactorizationModel is built, I have serialized it as a java object and when retrieving the model, I am getting the following exception. Caused by:…
1
vote
1 answer

Spark 1.4 Mllib LDA topicDistributions() returning wrong number of documents

I have an LDA model running on corpus size of 12,054 documents with vocab size of 9,681 words and 60 clusters. I am trying to get the topic distribution over documents by calling .topicDistributions() or .javaTopicDistributions(). Both of these…
smannan
  • 136
  • 1
  • 1
  • 4
1
vote
1 answer

How to save a Spark LogisticRegressionModel model?

I am using MLlib 1.1.0 and struggling to find a way to save my model. Docs do not seem to support such as feature in this version. Any ideas?
user706838
  • 5,132
  • 14
  • 54
  • 78
1
vote
1 answer

mllib and pyspark bag of words model for multiple text documents

I have 150 text documents (training set) that I would like to perform a "bag of words" representation on with pyspark and mllib package "feature". From here I then have another 150 text documents (testing set) that I would like to also convert each…
Matt
  • 1,196
  • 1
  • 9
  • 22
1
vote
3 answers

How to extract data from Spark MLlib FP Growth model

I am running spark master and slaves in standalone mode, no Hadoop cluster. Using spark-shell, I can quickly build a FPGrowthModel with my data. Once the model is built, I am trying to look at the patterns and frequencies captured within the model,…
emily
  • 198
  • 2
  • 10
1
vote
0 answers

Java heap space Error while running SVMWithSGD algorithm in MLlib

My fnl2 dataset is of the form: scala> fnl2.first() res4: org.apache.spark.mllib.regression.LabeledPoint =…
user706838
  • 5,132
  • 14
  • 54
  • 78
1
vote
1 answer

How to convert an RDD to Vector in Spark

I have an RDD of type RDD[(Int,Double)] in which the first element of the pair is the index and the second is the value and I'd like to convert this RDD to a Vector to use for classification. Could someone help me with that? I have the following…
HHH
  • 6,085
  • 20
  • 92
  • 164
1
vote
1 answer

How to convert Mahout VectorWritable to Vector in Spark

I have a VectorWritable (org.apache.mahout.math.VectorWritable) which is coming from a sequence file generated by Mahout and I would like to convert that into Vector (org.apache.spark.mllib.linalg.Vectors) type is Spark. How can I do that in Scala?
HHH
  • 6,085
  • 20
  • 92
  • 164
1
vote
1 answer

"main" java.lang.ClassCastException: [Lscala.Tuple2; cannot be cast to scala.Tuple2 in Spark MLlib LDA

I'm using Spark 1.3.0 (Scala 2.10.X) MLlib LDA algorithm with Spark Java API. I have the following issue when I try to read the document-topic distribution from LDA model during runtime. "main" java.lang.ClassCastException: [Lscala.Tuple2; cannot…
Jay
  • 63
  • 8
1
vote
1 answer

Issue with Zeppelin on Spark-Cassandra system: Classnotfoundexception

I have recently started to work with zeppelin on top of a Spark-Cassandra Cluster (Master + 3 Workers) System to run simple machine learning algorithms using the MLlib library. Here are the libraries that I loaded to…
1
vote
1 answer

Distributed BlockMatrix out of Spark Matrices

How to make a distributed BlockMatrix out of Matrices (of the same size)? For example, let A, B be two 2 by 2 mllib.linalg.Matrices as follows import org.apache.spark.mllib.linalg.{Matrix, Matrices} import…
Ehsan M. Kermani
  • 912
  • 2
  • 12
  • 26