Highest Voted 'apache-spark-mllib' Questions

11

votes

2 answers

Spark CrossValidatorModel access other models than the bestModel?

I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the…

apache-spark apache-spark-mllib cross-validation apache-spark-1.6

asked Aug 10 '16 at 13:14

MeiSign

1,487
1
15
39

11

votes

2 answers

Understanding Spark RandomForest featureImportances results

I'm using RandomForest.featureImportances but I don't understand the output result. I have 12 features, and this is the output I get. I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the…

apache-spark classification random-forest apache-spark-mllib

asked Jun 17 '16 at 09:54

other15

839
2
11
23

11

votes

2 answers

apply OneHotEncoder for several categorical columns in SparkMlib

I have several categorical features and would like to transform them all using OneHotEncoder. However, when I tried to apply the StringIndexer, there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol =…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Mar 04 '16 at 19:42

MYjx

4,157
9
38
53

11

votes

1 answer

Attach metadata to vector column in Spark

Context: I have a data frame with two columns: label, and features. org.apache.spark.sql.DataFrame = [label: int, features: vector] Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler. Question: Is there a way to…

scala apache-spark apache-spark-mllib apache-spark-ml

asked Feb 10 '16 at 01:07

gstvolvr

650
1
8
17

11

votes

0 answers

ERROR TaskSchedulerImpl: Exception in statusUpdate

I ran a python code on Spark using Mllib. It works fine with small datasets, but I'm getting the following error after two iterations for large datasets: ERROR TaskSchedulerImpl: Exception in…

apache-spark apache-spark-mllib

asked Sep 08 '15 at 13:31

Nooshin

943
1
9
24

11

votes

1 answer

PCA Analysis in PySpark

Looking at http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html. The examples seem to only contain Java and Scala. Does Spark MLlib support PCA analysis for Python? If so please point me to an example. If not, how to combine…

python apache-spark apache-spark-mllib pca apache-spark-ml

asked Aug 02 '15 at 17:01

lapolonio

1,107
2
14
24

11

votes

1 answer

Apache Spark: How to create a matrix from a DataFrame?

I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD? > imagerdd =…

python matrix apache-spark pyspark apache-spark-mllib

asked Jul 22 '15 at 15:47

NormallySane

165
1
2
7

10

votes

1 answer

Perform PCA on each group of a groupBy in PySpark

I am looking for a way to run the spark.ml.feature.PCA function over grouped data returned from a groupBy() call on a dataframe. But I'm not sure if this is possible, or how to achieve it. This is a basic example that hopefully illustrates what I…

python machine-learning pyspark pca apache-spark-mllib

asked Jul 21 '17 at 14:44

Tim B

3,033
1
23
28

10

votes

3 answers

pyspark randomForest feature importance: how to get column names from the column numbers

I am using the standard (string indexer + one hot encoder + randomForest) pipeline in spark, as shown below labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data) string_feature_indexers = [ …

pyspark apache-spark-mllib random-forest apache-spark-ml

asked Jul 11 '17 at 02:01

Abhishek

3,337
4
32
51

10

votes

1 answer

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of…

apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Jul 02 '17 at 00:27

Larissa Leite

1,358
3
21
36

10

votes

1 answer

Spark.ml regressions do not calculate same models as scikit-learn

I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the…

apache-spark scikit-learn apache-spark-mllib

asked Mar 10 '17 at 23:28

Frank

4,341
8
41
57

10

votes

1 answer

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly…

apache-spark apache-spark-mllib apache-spark-ml

asked Oct 26 '16 at 12:38

Kobe-Wan Kenobi

3,694
2
40
67

10

votes

2 answers

SPARK, ML, Tuning, CrossValidator: access the metrics

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) …

apache-spark apache-spark-mllib apache-spark-ml

asked Jan 08 '16 at 13:59

Rami

8,044
18
66
108

10

votes

3 answers

Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of a dense Vector of Doubles): scala> val scaledDataOnly_pruned = scaledDataOnly.select("features") scaledDataOnly_pruned:…

scala apache-spark rdd apache-spark-sql apache-spark-mllib

asked Oct 09 '15 at 22:43

Yeye

171
1
1
8

10

votes

2 answers

Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.

scala apache-spark apache-spark-mllib random-forest apache-spark-ml

asked Aug 15 '15 at 21:02

deniswsrosa

2,421
1
17
25

Questions tagged [apache-spark-mllib]