Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
9
votes
1 answer

How to map variable names to features after pipeline

I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df =…
lapolonio
  • 1,107
  • 2
  • 14
  • 24
8
votes
1 answer

Slowdown with repeated calls to spark dataframe in memory

Say I have 40 continuous (DoubleType) variables that I've bucketed into quartiles using ft_quantile_discretizer. Identifying the quartiles on all of the variables is super fast, as the function supports execution of multiple variables at once.…
hgb1234
  • 83
  • 3
8
votes
2 answers

How to print the decision path / rules used to predict sample of a specific row in PySpark?

How to print the decision path of a specific sample in a Spark DataFrame? Spark Version: '2.3.1' The below code prints the decision path of the whole model, how to make it print a decision path of a specific sample? For example, the decision path…
PolarBear10
  • 2,065
  • 7
  • 24
  • 55
8
votes
3 answers

Spark Java IllegalArgumentException at org.apache.xbean.asm5.ClassReader

I'm trying to use Spark 2.3.1 with Java. I followed examples in the documentation but keep getting poorly described exception when calling .fit(trainingData). Exception in thread "main" java.lang.IllegalArgumentException at…
8
votes
2 answers

ALS model - how to generate full_u * v^t * v?

I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I've copied the answer below for the reader's convenience: You can…
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
8
votes
1 answer

Spark schema from case class with correct nullability

For a custom Estimator`s transformSchema method I need to be able to compare the schema of a input data frame to the schema defined in a case class. Usually this could be performed like Generate a Spark StructType / Schema from a case class as…
8
votes
1 answer

Issue with VectorUDT when using Spark ML

I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD. Inside the UDAF, I have to specify a…
8
votes
2 answers

Any way to access methods from individual stages in PySpark PipelineModel?

I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This…
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
8
votes
5 answers

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific…
mt88
  • 2,855
  • 8
  • 24
  • 42
8
votes
1 answer

How to serialize a pyspark Pipeline object?

I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing. Providing the code snippet while…
8
votes
1 answer

Feature normalization algorithm in Spark

Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors: {0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, {-0.95, 0.018,…
Alex B
  • 347
  • 1
  • 3
  • 9
7
votes
1 answer

Getting the leaf probabilities of a tree model in spark

I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of…
nicola
  • 24,005
  • 3
  • 35
  • 56
7
votes
0 answers

Is there an ARIMA model in Spark Scala?

How can we do ARIMA modeling in spark scala? Can we directly import any ARIMA package like regression or classification? In Spark's ml library, we do not have anything like ARIMA model.
7
votes
1 answer

Reading a custom pyspark transformer

After messing with this for quite a while, in Spark 2.3 I am finally able to get a pure python custom transformer saved. But I get an error while loading the transformer back. I checked the content of what was saved and find all the relevant…
7
votes
4 answers

pyspark - Convert sparse vector obtained after one hot encoding into columns

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to…