Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
10
votes
2 answers

Spark StringIndexer.fit is very slow on large records

I have large data records formatted as the following sample: // +---+------+------+ // |cid|itemId|bought| // +---+------+------+ // |abc| 123| true| // |abc| 345| true| // |abc| 567| true| // |def| 123| true| // |def| 345| true| //…
10
votes
3 answers

pyspark randomForest feature importance: how to get column names from the column numbers

I am using the standard (string indexer + one hot encoder + randomForest) pipeline in spark, as shown below labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data) string_feature_indexers = [ …
Abhishek
  • 3,337
  • 4
  • 32
  • 51
10
votes
1 answer

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of…
Larissa Leite
  • 1,358
  • 3
  • 21
  • 36
10
votes
1 answer

How to get classification probabilities from PySpark MultilayerPerceptronClassifier?

I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels. My question is, is it possible to get not only the…
10
votes
1 answer

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly…
Kobe-Wan Kenobi
  • 3,694
  • 2
  • 40
  • 67
10
votes
3 answers

Serialization issues in Spark Streaming

I'm quite confused about how Spark works with the data under the hood. For example, when I run a streaming job and apply foreachRDD, the behaviour depends on whether a variable is captured from the outer scope or initialised inside. val sparkConf =…
lizarisk
  • 7,562
  • 10
  • 46
  • 70
10
votes
2 answers

SPARK, ML, Tuning, CrossValidator: access the metrics

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) …
Rami
  • 8,044
  • 18
  • 66
  • 108
10
votes
1 answer

Using Spark ML's OneHotEncoder on multiple columns

I've been able to create a pipeline that will allow me to index multiple string columns at once, but I am getting stuck encoding them, because unlike indexing, the encoder is not an estimator so I never call fit according to the OneHotEncoder…
Michael Discenza
  • 3,240
  • 7
  • 30
  • 41
10
votes
2 answers

Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.
9
votes
3 answers

Spark v3.0.0 - WARN DAGScheduler: broadcasting large task binary with size xx

I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set: SparkConf conf = new SparkConf(); conf.setMaster("local[*]"); conf.set("spark.driver.memory",…
vittoema96
  • 121
  • 1
  • 1
  • 6
9
votes
1 answer

In Spark ML, why is fitting a StringIndexer on a column with million of disctinct values yielding an OOM error?

I am trying to use Spark's StringIndexer feature transformer on a column with about 15.000.000 unique string values. Regardless of how many resources I throw at it, Spark always dies on me with some sort of Out Of Memory exception. from…
Interfector
  • 1,868
  • 1
  • 23
  • 43
9
votes
2 answers

Online learning of LDA model in Spark

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?
9
votes
4 answers

Relating column names to model parameters in pySpark ML

I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a…
Jeff
  • 2,158
  • 1
  • 16
  • 29
9
votes
1 answer

How to combine n-grams into one vocabulary in Spark?

Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to…
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
9
votes
1 answer

Non linear (DAG) ML pipelines in Apache Spark

I've set-up a simple Spark-ML app, where I have a pipeline of independent transformers that add columns to a dataframe of raw data. Since the transformers don't look at the output of one another I was hoping I could run them in parallel in a…
hillel
  • 2,343
  • 2
  • 18
  • 25