Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
11
votes
2 answers

Spark CrossValidatorModel access other models than the bestModel?

I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the…
11
votes
2 answers

Understanding Spark RandomForest featureImportances results

I'm using RandomForest.featureImportances but I don't understand the output result. I have 12 features, and this is the output I get. I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the…
11
votes
2 answers

apply OneHotEncoder for several categorical columns in SparkMlib

I have several categorical features and would like to transform them all using OneHotEncoder. However, when I tried to apply the StringIndexer, there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol =…
MYjx
  • 4,157
  • 9
  • 38
  • 53
11
votes
1 answer

Attach metadata to vector column in Spark

Context: I have a data frame with two columns: label, and features. org.apache.spark.sql.DataFrame = [label: int, features: vector] Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler. Question: Is there a way to…
gstvolvr
  • 650
  • 1
  • 8
  • 17
11
votes
0 answers

ERROR TaskSchedulerImpl: Exception in statusUpdate

I ran a python code on Spark using Mllib. It works fine with small datasets, but I'm getting the following error after two iterations for large datasets: ERROR TaskSchedulerImpl: Exception in…
Nooshin
  • 943
  • 1
  • 9
  • 24
11
votes
1 answer

PCA Analysis in PySpark

Looking at http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html. The examples seem to only contain Java and Scala. Does Spark MLlib support PCA analysis for Python? If so please point me to an example. If not, how to combine…
lapolonio
  • 1,107
  • 2
  • 14
  • 24
11
votes
1 answer

Apache Spark: How to create a matrix from a DataFrame?

I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD? > imagerdd =…
NormallySane
  • 165
  • 1
  • 2
  • 7
10
votes
1 answer

Perform PCA on each group of a groupBy in PySpark

I am looking for a way to run the spark.ml.feature.PCA function over grouped data returned from a groupBy() call on a dataframe. But I'm not sure if this is possible, or how to achieve it. This is a basic example that hopefully illustrates what I…
Tim B
  • 3,033
  • 1
  • 23
  • 28
10
votes
3 answers

pyspark randomForest feature importance: how to get column names from the column numbers

I am using the standard (string indexer + one hot encoder + randomForest) pipeline in spark, as shown below labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data) string_feature_indexers = [ …
Abhishek
  • 3,337
  • 4
  • 32
  • 51
10
votes
1 answer

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of…
Larissa Leite
  • 1,358
  • 3
  • 21
  • 36
10
votes
1 answer

Spark.ml regressions do not calculate same models as scikit-learn

I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the…
Frank
  • 4,341
  • 8
  • 41
  • 57
10
votes
1 answer

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly…
Kobe-Wan Kenobi
  • 3,694
  • 2
  • 40
  • 67
10
votes
2 answers

SPARK, ML, Tuning, CrossValidator: access the metrics

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) …
Rami
  • 8,044
  • 18
  • 66
  • 108
10
votes
3 answers

Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of a dense Vector of Doubles): scala> val scaledDataOnly_pruned = scaledDataOnly.select("features") scaledDataOnly_pruned:…
Yeye
  • 171
  • 1
  • 1
  • 8
10
votes
2 answers

Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.