Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
15
votes
1 answer

How are number of iterations and number of partitions releated in Apache spark Word2Vec?

According to mllib.feature.Word2Vec - spark 1.3.1 documentation [1]: def setNumIterations(numIterations: Int): Word2Vec.this.type Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. def…
Arshiyan Alam
  • 335
  • 1
  • 11
15
votes
1 answer

Spark LDA consumes too much memory

I'm trying to use spark mllib lda to summarize my document corpus. My problem setting is as bellow. about 100,000 documents about 400,000 unique words 100 cluster I have 16 servers (each has 20 cores and 128GB memory). When I execute LDA with…
Du Shiqiao
  • 377
  • 1
  • 9
15
votes
1 answer

How to save and load MLLib model in Apache Spark?

I trained a classification model in Apache Spark (using pyspark). I stored the model in an object, LogisticRegressionModel. Now, I want to make predictions on new data. I would like to store the model, and read it back into a new program in order to…
berto77
  • 885
  • 3
  • 12
  • 29
15
votes
3 answers

How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new…
SH Y.
  • 1,709
  • 3
  • 20
  • 21
15
votes
2 answers

From DataFrame to RDD[LabeledPoint]

I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following: import org.apache.spark.sql.{Row, SQLContext} import org.apache.spark.sql.types.{StringType,…
Miguel
  • 1,201
  • 2
  • 13
  • 30
15
votes
2 answers

How to update Spark MatrixFactorizationModel for ALS

I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html. I also have problems with explicit training like here: Apache Spark ALS collaborative…
14
votes
1 answer

Matrix Math With Sparklyr

Looking to convert some R code to Sparklyr, functions such as lmtest::coeftest() and sandwich::sandwich(). Trying to get started with Sparklyr extensions but pretty new to the Spark API and having issues :( Running Spark 2.1.1 and sparklyr…
Zafar
  • 1,897
  • 15
  • 33
14
votes
1 answer

How to convert spark DataFrame to RDD mllib LabeledPoints?

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that? My code: import…
Tianyi Wang
  • 197
  • 1
  • 1
  • 6
14
votes
1 answer

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.
Rami
  • 8,044
  • 18
  • 66
  • 108
14
votes
1 answer

Create labeledPoints from Spark DataFrame in Python

What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this…
14
votes
5 answers

How to integrate Apache Spark with Spring MVC web application for interactive user sessions

I am trying to build a Movie Recommender System Using Apache Spark MLlib. I have written a code for recommender in java and its working fine when run using spark-submit command. My run command looks like this bin/spark-submit --jars…
14
votes
1 answer

Spark MLlib - trainImplicit warning

I keep seeing these warnings when using trainImplicit: WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition…
Tarantula
  • 19,031
  • 12
  • 54
  • 71
14
votes
2 answers

Addition of two RDD[mllib.linalg.Vector]'s

I need addition of two matrices that are stored in two files. The content of latest1.txt and latest2.txt has the next str: 1 2 3 4 5 6 7 8 9 I am reading those files as follows: scala> val rows = sc.textFile(“latest1.txt”).map { line => val values…
krishna
  • 177
  • 19
  • 30
  • 60
13
votes
1 answer

Spark LinearRegressionSummary "normal" summary

According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver. This value is only available when using the "normal" solver. What the hell is the "normal" solver? I'm doing this: import…
Paul Reiners
  • 8,576
  • 33
  • 117
  • 202
13
votes
2 answers

How to extract a value from a Vector in a column of a Spark Dataframe

When using SparkML to predict labels the result Dataframe is: scala> result.show +-----------+--------------+ |probability|predictedLabel| +-----------+--------------+ | [0.0,1.0]| 0.0| | [0.0,1.0]| 0.0| | [0.0,1.0]| …