Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
1 answer

How to store the text file on the Master?

I am using Standalone clusters to run the ALS algorithm. The predictions are being stored to the textfile using: saveAsTextFile(path) But the text file is being stored on the clusters. I want to store the text file on the Master.
Shishir Anshuman
  • 1,115
  • 7
  • 23
1
vote
1 answer

Gaussian Mixture Model in scala spark 1.5.1 weights are always uniformly distributed

I implemented the default gmm model provided in mllib for my algorithm. I am repeatedly finding that the resultant weights are always equally waited no matter how many clusters i initiate. Is there any specific reason why the weights are not being…
Leothorn
  • 1,345
  • 1
  • 23
  • 45
1
vote
1 answer

Issue when writing to file in spark

I'm working on spark in local mode with the following options spark-shell --driver-memory 21G --executor-memory 10G --num-executors 4 --driver-java-options "-Dspark.executor.memory=10G" --executor-cores 8 It is a four node cluster of 32G RAM…
tourist
  • 4,165
  • 6
  • 25
  • 47
1
vote
0 answers

MLLib Classification deployment over http

I want to deploy a classifer I trained using mllib over http service. So, I am wondering whether if I load the serialized object in my code and send it some data is it necessary to run a local version of spark as well. And if so is there any effect…
ilijaluve
  • 1,050
  • 2
  • 10
  • 24
1
vote
0 answers

Spark - MLlib Obtain loss (cost/error) history from LogisticRegressionWithLBFGS

I am using Apache-Spark to perform logistic regression w/ LBFGS. I am trying to generate Learning Curves to see whether my model is suffering from high bias or high variance. Andrew Ng discusses the usefulness of learning curves in his Lecture on…
Brian
  • 7,098
  • 15
  • 56
  • 73
1
vote
2 answers

Converting a [(Int, Seq[Double])] RDD to LabeledPoint

I have an RDD of the following format and would like to convert it into a LabeledPoint RDD in order to process it in mllib : Test: RDD[(Int, Seq[Double])] = Array((1,List(1.0,3.0,8.0),(2,List(3.0,…
ulrich
  • 3,547
  • 5
  • 35
  • 49
1
vote
0 answers

java.lang.OutOfMemoryError when saving a model on the disk using Spark-mllib

I am trying to run LDA on a very small dataset of ~1000 documents. The LDA work fine and I am also able to save the model. If I run the program without lDAModel.save(), I get the following at the end: 16/03/13 14:26:52 INFO SparkUI: Stopped Spark…
Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130
1
vote
1 answer

How to update MLLIB version in PySpark

I have installed Cloudera VM and hence it has PySpark with MLLIB library, but the ML library MLLIB is too old, I just wanted to upgrade it with latest version of MLLIB, Already updated the python from 2.6 to 2.7, but unable to find any documentation…
1
vote
1 answer

Multiclass classification with Gradient Boosting Trees in Spark: only supporting binary classification

While trying to run multi-class classification using Gradient Boosting Trees in Spark mllib. But it is giving an error "only binary classification is supported". The dependent variable has 8 levels. The data has 276 columns and 7000…
1
vote
0 answers

Spark submit - Input_raw python

i want test my model with a input values. So, my script is #!/usr/bin/env python # -*- coding: utf-8 -*- import csv import sys import os from pyspark.mllib.regression import LabeledPoint import numpy as np from pyspark.mllib.evaluation import…
SirGustave
  • 342
  • 1
  • 2
  • 13
1
vote
1 answer

In the Spark UI, what does it mean when a task has a status of GET RESULT?

I have a Spark job which trains a model using Spark ML's logistic regression. In the Spark UI under the stage details page for a tree aggregation stage I see a few tasks with a status of "GET RESULT". What does this status mean? What causes a task…
1
vote
1 answer

Linking the resulting TFIDF sparse vectors to the original documents in Spark

I am calculating the TFIDF using Spark with Python using the following code: hashingTF = HashingTF() tf = hashingTF.transform(documents) idf = IDF().fit(tf) tfidf = idf.transform(tf) for k in tfidf.collect(): print(k) I…
K.Ali
  • 283
  • 4
  • 15
1
vote
1 answer

Summation of TFIDF sparse vector values for each document in Spark with Python

I calculated the TFIDF for 3 sample text documents using HashingTF and IDF of Pyspark and I got the following SparseVector result: (1048576,[558379],[1.43841036226]) (1048576,[181911,558379,959994], …
K.Ali
  • 283
  • 4
  • 15
1
vote
1 answer

Apache Spark MLlib LabeledPoint null label issue

I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database. This algorithm takes the training set as LabeledPoint. Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my…
1
vote
1 answer

Create a JavaRDD without file in spark

I am totally new to spark and I want to create a JavaRDD from labeled points programmatically without reading input from file. Say I create few Labeledpoints as following, LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)); …
user1097675
  • 33
  • 1
  • 2
  • 6