Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
1 answer

How to apply pyspark-mllib-kmeans to categorical variables

There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark? Thank you
SparkiTony
  • 11
  • 3
1
vote
1 answer

Java heap space in spark mllib

I have the following code which runs computes some metrics by cross-validation for a random forest classification. def run(data:RDD[LabeledPoint], metric:String = "PR") = { val cv_data:Array[(RDD[LabeledPoint], RDD[LabeledPoint])] =…
Pop
  • 12,135
  • 5
  • 55
  • 68
1
vote
1 answer

LDA in spark: some training documents missing from LDA model. What happened to them?

I build my corpus from a text file, and corpus is a JavaPairRDD of a document ID (creatd with zipWithIndex()) and a count of how many times each word in the vocabulary appears in each document. I try to count the documents below and I…
maccam912
  • 792
  • 1
  • 7
  • 22
1
vote
1 answer

Converting CoordinateMatrix to Array?

I created a CoordinateMatrix: import org.apache.spark.mllib.linalg.distributed.{ CoordinateMatrix, MatrixEntry} val entries = sc.parallelize(Seq( MatrixEntry(0, 1, 1), MatrixEntry(0, 2, 2), MatrixEntry(0, 3, 3), MatrixEntry(0, 4, 4),…
Xiaoyu Chen
  • 295
  • 1
  • 4
  • 12
1
vote
0 answers

Can I extract fp-tree (any format) in spark?

FPGrowth finds frequent itemsets with data-set in Apache Spark. But, I really need to get a fp-tree to visualize my data-set. Is it possible to get the fp-tree which spark constructs with my data-set?
1
vote
0 answers

Cluster Center in Spark Streaming k-means Clustering

I am using Streaming k-means to cluster some 2-dimensional stream data using the example in http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means. code: model = StreamingKMeans(k=5, decayFactor=0.7).setRandomCenters(2, 1.0,…
Saeed
  • 357
  • 4
  • 11
1
vote
0 answers

Configure Spark MLlib in Eclipse Scala IDE

I am facing problem, actually could not get any guideline how to configure spark MLlib in Eclipse Scala IDE. Can anyone help me out there providing me how to configure spark MLlib in Scala IDE? In addition, can anybody tell me how to implement…
mmr
  • 133
  • 3
  • 11
1
vote
1 answer

Text classification - how to approach

I'll try do describe what I have in mind. There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only…
1
vote
1 answer

Spark: Logistic regression

This code works great! val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training) I am able to call model.predict(...) However, when I try to setup the model parameters, I can't call model.predict For example, with the following…
user3803714
  • 5,269
  • 10
  • 42
  • 61
1
vote
0 answers

Spark Random Forest Model Save Method is Not Working

I have recently upgraded the version of Spark from 1.3 to 1.5. I am using model.save to save the random forest model. My code was working fine in 1.3 but in 1.5 I am getting following error. ERROR: org.apache.spark.executor.Executor -…
Alchemist
  • 849
  • 2
  • 10
  • 27
1
vote
1 answer

Kryo registration of LabeledPoint class

I am trying to run a very simple scala class in spark with Kryo registration. This class just loads data from a file into an RDD[LabeledPoint]. The code (inspired from the one in https://spark.apache.org/docs/latest/mllib-decision-tree.html): import…
Pop
  • 12,135
  • 5
  • 55
  • 68
1
vote
3 answers

Is it possible to obtain class probabilities using GradientBoostedTrees with spark mllib?

I am currently working with spark mllib. I have created a text classifier using the Gradient Boosting algorithm with the class GradientBoostedTrees: Gradient Boosted Trees Currently I obtain the predictions to know the class of new elements but I…
Rob
  • 1,080
  • 2
  • 10
  • 24
1
vote
1 answer

Difference in AUCs b/w Apache-Spark's GBT and sklearn

I tried GBDTs both with Python's sklearn as well as Spark's local stand-alone MLlib implementation with default settings for a binary classification problem. I kept the numIterations, loss function same in both the cases. The features are all real…
1
vote
1 answer

Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. It only takes ~200 seconds to fit 10 topics on 9M documents (tweets). val numTopics=10 val lda = new LDA() .setOptimizer(new…
1
vote
0 answers

CF using MlLib ALS, when should I stop recommending?

I am using spark MlLib ALS CF algorithm to build a recommender system for an e-commerce website. I am required by the owner of the website, to sort for each individual user, all 4000 items in the catalog according to that user`s likelihood to buy…