Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
12
votes
3 answers

How to use XGboost in PySpark Pipeline

I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can…
12
votes
3 answers

How to overwrite Spark ML model in PySpark?

from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" +…
12
votes
2 answers

Spark ML - MulticlassClassificationEvaluator - can we get precision/recall by each class label?

I am doing a multiclass prediction with random forest in Spark ML. For this MulticlassClassificationEvaluator() in spark ML, is it possible to get precision/recall by each class labels? Currently, I am only seeing precision/recall combined for all…
11
votes
2 answers

IllegalArgumentException: Column must be of type struct,values:array> but was actually double.'

I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns: from pyspark.ml.stat import ChiSquareTest r = ChiSquareTest.test(df, 'feature1',…
Pratham Solanki
  • 337
  • 2
  • 5
  • 16
11
votes
1 answer

Save and load two ML models in pyspark

First I create two ML algorithms and save them to two separate files. Note that both models are based on the same dataframe. feature_1 and feature_2 are different sets of features extracted from the same dataset. import sys from…
PaulMag
  • 3,928
  • 3
  • 18
  • 29
11
votes
1 answer

ALS model - predicted full_u * v^t * v ratings are very high

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet…
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
11
votes
2 answers

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…
11
votes
2 answers

PySpark: How to evaluate AUC of ML recomendation algorithm?

I have a Spark Dataframe as below: predictions.show(5) +------+----+------+-----------+ | user|item|rating| prediction| +------+----+------+-----------+ |379433| 31| 1| 0.08203495| | 1834| 31| 1| 0.4854447| |422635| 31| …
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
11
votes
2 answers

apply OneHotEncoder for several categorical columns in SparkMlib

I have several categorical features and would like to transform them all using OneHotEncoder. However, when I tried to apply the StringIndexer, there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol =…
MYjx
  • 4,157
  • 9
  • 38
  • 53
11
votes
2 answers

How to get probabilities corresponding to the class from Spark ML random forest

I've been using org.apache.spark.ml.Pipeline for machine learning tasks. It is particularly important to know the actual probabilities instead of just a predicted label , and I am having difficulties to get it. Here I am doing a binary…
Qing
  • 123
  • 1
  • 10
11
votes
1 answer

Attach metadata to vector column in Spark

Context: I have a data frame with two columns: label, and features. org.apache.spark.sql.DataFrame = [label: int, features: vector] Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler. Question: Is there a way to…
gstvolvr
  • 650
  • 1
  • 8
  • 17
11
votes
2 answers

spark.ml StringIndexer throws 'Unseen label' on fit()

I'm preparing a toy spark.ml example. Spark version 1.6.0, running on top of Oracle JDK version 1.8.0_65, pyspark, ipython notebook. First, it hardly has anything to do with Spark, ML, StringIndexer: handling unseen labels. The exception is thrown…
11
votes
3 answers

How to create a custom Transformer from a UDF?

I was trying to create and save a Pipeline with custom stages. I need to add a column to my DataFrame by using a UDF. Therefore, I was wondering if it was possible to convert a UDF or a similar action into a Transformer? My custom UDF looks like…
11
votes
1 answer

PCA Analysis in PySpark

Looking at http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html. The examples seem to only contain Java and Scala. Does Spark MLlib support PCA analysis for Python? If so please point me to an example. If not, how to combine…
lapolonio
  • 1,107
  • 2
  • 14
  • 24
10
votes
1 answer

StandardScaler in Spark not working as expected

Any idea why spark would be doing this for StandardScaler? As per the definition of StandardScaler: The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit…
Shrikar
  • 840
  • 1
  • 8
  • 30