Highest Voted 'apache-spark-mllib' Questions

1

vote

1 answer

How to apply pyspark-mllib-kmeans to categorical variables

There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark? Thank you

python apache-spark pyspark apache-spark-mllib

asked Jan 10 '16 at 00:27

SparkiTony

11
3

1

vote

1 answer

Java heap space in spark mllib

I have the following code which runs computes some metrics by cross-validation for a random forest classification. def run(data:RDD[LabeledPoint], metric:String = "PR") = { val cv_data:Array[(RDD[LabeledPoint], RDD[LabeledPoint])] =…

scala apache-spark apache-spark-mllib

asked Jan 06 '16 at 09:11

Pop

12,135
5
55
68

1

vote

1 answer

LDA in spark: some training documents missing from LDA model. What happened to them?

I build my corpus from a text file, and corpus is a JavaPairRDD of a document ID (creatd with zipWithIndex()) and a count of how many times each word in the vocabulary appears in each document. I try to count the documents below and I…

apache-spark nlp lda apache-spark-mllib

asked Jan 05 '16 at 15:11

maccam912

792
1
7
22

1

vote

1 answer

Converting CoordinateMatrix to Array?

I created a CoordinateMatrix: import org.apache.spark.mllib.linalg.distributed.{ CoordinateMatrix, MatrixEntry} val entries = sc.parallelize(Seq( MatrixEntry(0, 1, 1), MatrixEntry(0, 2, 2), MatrixEntry(0, 3, 3), MatrixEntry(0, 4, 4),…

scala apache-spark apache-spark-mllib

asked Dec 31 '15 at 07:20

Xiaoyu Chen

295
1
4
12

1

vote

0 answers

Can I extract fp-tree (any format) in spark?

FPGrowth finds frequent itemsets with data-set in Apache Spark. But, I really need to get a fp-tree to visualize my data-set. Is it possible to get the fp-tree which spark constructs with my data-set?

apache-spark distributed-system apache-spark-mllib apache-spark-ml

asked Dec 30 '15 at 05:56

arubirate

87
6

1

vote

0 answers

Cluster Center in Spark Streaming k-means Clustering

I am using Streaming k-means to cluster some 2-dimensional stream data using the example in http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means. code: model = StreamingKMeans(k=5, decayFactor=0.7).setRandomCenters(2, 1.0,…

apache-spark k-means spark-streaming apache-spark-mllib

asked Dec 27 '15 at 23:54

Saeed

357
4
11

1

vote

0 answers

Configure Spark MLlib in Eclipse Scala IDE

I am facing problem, actually could not get any guideline how to configure spark MLlib in Eclipse Scala IDE. Can anyone help me out there providing me how to configure spark MLlib in Scala IDE? In addition, can anybody tell me how to implement…

scala apache-spark apache-spark-mllib

asked Dec 25 '15 at 03:24

mmr

133
3
11

1

vote

1 answer

Text classification - how to approach

I'll try do describe what I have in mind. There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only…

apache-spark machine-learning apache-spark-mllib apache-spark-ml

asked Dec 17 '15 at 22:09

Igor K.

915
2
12
22

1

vote

1 answer

Spark: Logistic regression

This code works great! val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training) I am able to call model.predict(...) However, when I try to setup the model parameters, I can't call model.predict For example, with the following…

scala apache-spark apache-spark-mllib

asked Dec 17 '15 at 19:09

user3803714

5,269
10
42
61

1

vote

0 answers

Spark Random Forest Model Save Method is Not Working

I have recently upgraded the version of Spark from 1.3 to 1.5. I am using model.save to save the random forest model. My code was working fine in 1.3 but in 1.5 I am getting following error. ERROR: org.apache.spark.executor.Executor -…

apache-spark random-forest apache-spark-mllib

asked Dec 16 '15 at 18:54

Alchemist

849
2
10
27

1

vote

1 answer

Kryo registration of LabeledPoint class

I am trying to run a very simple scala class in spark with Kryo registration. This class just loads data from a file into an RDD[LabeledPoint]. The code (inspired from the one in https://spark.apache.org/docs/latest/mllib-decision-tree.html): import…

scala apache-spark apache-spark-mllib kryo

asked Dec 16 '15 at 08:33

Pop

12,135
5
55
68

1

vote

3 answers

Is it possible to obtain class probabilities using GradientBoostedTrees with spark mllib?

I am currently working with spark mllib. I have created a text classifier using the Gradient Boosting algorithm with the class GradientBoostedTrees: Gradient Boosted Trees Currently I obtain the predictions to know the class of new elements but I…

apache-spark predict apache-spark-mllib

asked Dec 10 '15 at 17:48

Rob

1,080
2
10
24

1

vote

1 answer

Difference in AUCs b/w Apache-Spark's GBT and sklearn

I tried GBDTs both with Python's sklearn as well as Spark's local stand-alone MLlib implementation with default settings for a binary classification problem. I kept the numIterations, loss function same in both the cases. The features are all real…

apache-spark scikit-learn pyspark apache-spark-mllib

asked Dec 10 '15 at 09:13

Darth_SK

11
2

1

vote

1 answer

Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. It only takes ~200 seconds to fit 10 topics on 9M documents (tweets). val numTopics=10 val lda = new LDA() .setOptimizer(new…

apache-spark lda apache-spark-mllib

asked Dec 08 '15 at 21:50

Scott Nelson

11
4

1

vote

0 answers

CF using MlLib ALS, when should I stop recommending?

I am using spark MlLib ALS CF algorithm to build a recommender system for an e-commerce website. I am required by the owner of the website, to sort for each individual user, all 4000 items in the catalog according to that user`s likelihood to buy…

apache-spark apache-spark-mllib collaborative-filtering

asked Dec 06 '15 at 11:31

user5646285

19
2

Questions tagged [apache-spark-mllib]