Highest Voted 'apache-spark-ml' Questions

12

votes

3 answers

How to use XGboost in PySpark Pipeline

I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can…

apache-spark pyspark apache-spark-mllib xgboost apache-spark-ml

asked May 30 '18 at 10:26

Daniel Du

121
1
1
4

12

votes

3 answers

How to overwrite Spark ML model in PySpark?

from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" +…

apache-spark machine-learning pyspark apache-spark-mllib apache-spark-ml

asked Feb 17 '17 at 17:12

Veronica Cheng

447
4
13

12

votes

2 answers

Spark ML - MulticlassClassificationEvaluator - can we get precision/recall by each class label?

I am doing a multiclass prediction with random forest in Spark ML. For this MulticlassClassificationEvaluator() in spark ML, is it possible to get precision/recall by each class labels? Currently, I am only seeing precision/recall combined for all…

apache-spark machine-learning apache-spark-ml multiclass-classification

asked Dec 27 '16 at 20:20

Sam

121
1
5

11

votes

2 answers

IllegalArgumentException: Column must be of type struct,values:array> but was actually double.'

I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns: from pyspark.ml.stat import ChiSquareTest r = ChiSquareTest.test(df, 'feature1',…

apache-spark pyspark apache-spark-ml

asked Apr 06 '20 at 08:56

Pratham Solanki

337
2
5
16

11

votes

1 answer

Save and load two ML models in pyspark

First I create two ML algorithms and save them to two separate files. Note that both models are based on the same dataframe. feature_1 and feature_2 are different sets of features extracted from the same dataset. import sys from…

python apache-spark pyspark apache-spark-ml

asked Aug 01 '17 at 16:18

PaulMag

3,928
3
18
29

11

votes

1 answer

ALS model - predicted full_u * v^t * v ratings are very high

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet…

apache-spark apache-spark-mllib apache-spark-ml

asked Jan 10 '17 at 12:32

Chris Snow

23,813
35
144
309

11

votes

2 answers

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…

apache-spark pyspark apache-spark-mllib apache-spark-ml apache-spark-2.0

asked Dec 26 '16 at 09:05

Hardik Gupta

4,700
9
41
83

11

votes

2 answers

PySpark: How to evaluate AUC of ML recomendation algorithm?

I have a Spark Dataframe as below: predictions.show(5) +------+----+------+-----------+ | user|item|rating| prediction| +------+----+------+-----------+ |379433| 31| 1| 0.08203495| | 1834| 31| 1| 0.4854447| |422635| 31| …

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Nov 01 '16 at 18:45

Baktaawar

7,086
24
81
149

11

votes

2 answers

apply OneHotEncoder for several categorical columns in SparkMlib

I have several categorical features and would like to transform them all using OneHotEncoder. However, when I tried to apply the StringIndexer, there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol =…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Mar 04 '16 at 19:42

MYjx

4,157
9
38
53

11

votes

2 answers

How to get probabilities corresponding to the class from Spark ML random forest

I've been using org.apache.spark.ml.Pipeline for machine learning tasks. It is particularly important to know the actual probabilities instead of just a predicted label , and I am having difficulties to get it. Here I am doing a binary…

scala apache-spark apache-spark-ml

asked Feb 26 '16 at 00:03

Qing

123
1
10

11

votes

1 answer

Attach metadata to vector column in Spark

Context: I have a data frame with two columns: label, and features. org.apache.spark.sql.DataFrame = [label: int, features: vector] Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler. Question: Is there a way to…

scala apache-spark apache-spark-mllib apache-spark-ml

asked Feb 10 '16 at 01:07

gstvolvr

650
1
8
17

11

votes

2 answers

spark.ml StringIndexer throws 'Unseen label' on fit()

I'm preparing a toy spark.ml example. Spark version 1.6.0, running on top of Oracle JDK version 1.8.0_65, pyspark, ipython notebook. First, it hardly has anything to do with Spark, ML, StringIndexer: handling unseen labels. The exception is thrown…

apache-spark dataframe pyspark apache-spark-sql apache-spark-ml

asked Feb 05 '16 at 12:48

alreadyexists

355
5
13

11

votes

3 answers

How to create a custom Transformer from a UDF?

I was trying to create and save a Pipeline with custom stages. I need to add a column to my DataFrame by using a UDF. Therefore, I was wondering if it was possible to convert a UDF or a similar action into a Transformer? My custom UDF looks like…

scala apache-spark apache-spark-sql user-defined-functions apache-spark-ml

asked Feb 03 '16 at 15:03

Alberto Bonsanto

17,556
10
64
93

11

votes

1 answer

PCA Analysis in PySpark

Looking at http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html. The examples seem to only contain Java and Scala. Does Spark MLlib support PCA analysis for Python? If so please point me to an example. If not, how to combine…

python apache-spark apache-spark-mllib pca apache-spark-ml

asked Aug 02 '15 at 17:01

lapolonio

1,107
2
14
24

10

votes

1 answer

StandardScaler in Spark not working as expected

Any idea why spark would be doing this for StandardScaler? As per the definition of StandardScaler: The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit…

apache-spark pyspark apache-spark-ml

asked Aug 08 '18 at 18:07

Shrikar

840
1
8
30

Questions tagged [apache-spark-ml]