Highest Voted 'apache-spark-mllib' Questions

15

votes

1 answer

How are number of iterations and number of partitions releated in Apache spark Word2Vec?

According to mllib.feature.Word2Vec - spark 1.3.1 documentation [1]: def setNumIterations(numIterations: Int): Word2Vec.this.type Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. def…

apache-spark apache-spark-mllib word2vec

asked Jun 02 '16 at 04:53

Arshiyan Alam

335
1
11

15

votes

1 answer

Spark LDA consumes too much memory

I'm trying to use spark mllib lda to summarize my document corpus. My problem setting is as bellow. about 100,000 documents about 400,000 unique words 100 cluster I have 16 servers (each has 20 cores and 128GB memory). When I execute LDA with…

apache-spark apache-spark-mllib lda

asked Mar 14 '16 at 03:59

Du Shiqiao

377
1
9

15

votes

1 answer

How to save and load MLLib model in Apache Spark?

I trained a classification model in Apache Spark (using pyspark). I stored the model in an object, LogisticRegressionModel. Now, I want to make predictions on new data. I would like to store the model, and read it back into a new program in order to…

python apache-spark pyspark apache-spark-mllib

asked Dec 14 '15 at 15:13

berto77

885
3
12
29

15

votes

3 answers

How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new…

java scala apache-spark apache-spark-mllib apache-spark-ml

asked Aug 30 '15 at 01:09

SH Y.

1,709
3
20
21

15

votes

2 answers

From DataFrame to RDD[LabeledPoint]

I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following: import org.apache.spark.sql.{Row, SQLContext} import org.apache.spark.sql.types.{StringType,…

scala apache-spark apache-spark-mllib

asked Jun 18 '15 at 21:06

Miguel

1,201
2
13
30

15

votes

2 answers

How to update Spark MatrixFactorizationModel for ALS

I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html. I also have problems with explicit training like here: Apache Spark ALS collaborative…

apache-spark machine-learning apache-spark-mllib collaborative-filtering

asked May 28 '15 at 14:22

mniehoff

507
1
5
15

14

votes

1 answer

Matrix Math With Sparklyr

Looking to convert some R code to Sparklyr, functions such as lmtest::coeftest() and sandwich::sandwich(). Trying to get started with Sparklyr extensions but pretty new to the Spark API and having issues :( Running Spark 2.1.1 and sparklyr…

r apache-spark apache-spark-mllib sparklyr

asked Jun 17 '17 at 06:52

Zafar

1,897
15
33

14

votes

1 answer

How to convert spark DataFrame to RDD mllib LabeledPoints?

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that? My code: import…

scala apache-spark rdd pca apache-spark-mllib

asked Mar 13 '16 at 05:35

Tianyi Wang

197
1
1
6

14

votes

1 answer

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

apache-spark lda apache-spark-mllib topic-modeling

asked Sep 16 '15 at 09:22

Rami

8,044
18
66
108

14

votes

1 answer

Create labeledPoints from Spark DataFrame in Python

What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this…

python pandas apache-spark apache-spark-mllib apache-spark-ml

asked Sep 14 '15 at 01:29

user1518003

321
1
7
18

14

votes

5 answers

How to integrate Apache Spark with Spring MVC web application for interactive user sessions

I am trying to build a Movie Recommender System Using Apache Spark MLlib. I have written a code for recommender in java and its working fine when run using spark-submit command. My run command looks like this bin/spark-submit --jars…

java spring-mvc apache-spark machine-learning apache-spark-mllib

asked Jun 12 '15 at 05:38

hard coder

5,449
6
36
61

14

votes

1 answer

Spark MLlib - trainImplicit warning

I keep seeing these warnings when using trainImplicit: WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition…

python apache-spark pyspark apache-spark-mllib

asked Apr 22 '15 at 17:27

Tarantula

19,031
12
54
71

14

votes

2 answers

Addition of two RDD[mllib.linalg.Vector]'s

I need addition of two matrices that are stored in two files. The content of latest1.txt and latest2.txt has the next str: 1 2 3 4 5 6 7 8 9 I am reading those files as follows: scala> val rows = sc.textFile(“latest1.txt”).map { line => val values…

scala apache-spark apache-spark-mllib

asked Jan 30 '15 at 09:29

krishna

177
19
30
60

13

votes

1 answer

Spark LinearRegressionSummary "normal" summary

According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver. This value is only available when using the "normal" solver. What the hell is the "normal" solver? I'm doing this: import…

apache-spark-mllib

asked Oct 11 '17 at 19:49

Paul Reiners

8,576
33
117
202

13

votes

2 answers

How to extract a value from a Vector in a column of a Spark Dataframe

When using SparkML to predict labels the result Dataframe is: scala> result.show +-----------+--------------+ |probability|predictedLabel| +-----------+--------------+ | [0.0,1.0]| 0.0| | [0.0,1.0]| 0.0| | [0.0,1.0]| …

scala apache-spark dataframe apache-spark-sql apache-spark-mllib

asked May 02 '17 at 06:08

you zhenghong

139
1
1
4

Questions tagged [apache-spark-mllib]