Highest Voted 'apache-spark-ml' Questions

10

votes

2 answers

Spark StringIndexer.fit is very slow on large records

I have large data records formatted as the following sample: // +---+------+------+ // |cid|itemId|bought| // +---+------+------+ // |abc| 123| true| // |abc| 345| true| // |abc| 567| true| // |def| 123| true| // |def| 345| true| //…

apache-spark apache-spark-ml apache-spark-dataset

asked Jul 23 '18 at 19:00

Rengasami Ramanujam

1,858
4
19
29

10

votes

3 answers

pyspark randomForest feature importance: how to get column names from the column numbers

I am using the standard (string indexer + one hot encoder + randomForest) pipeline in spark, as shown below labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data) string_feature_indexers = [ …

pyspark apache-spark-mllib random-forest apache-spark-ml

asked Jul 11 '17 at 02:01

Abhishek

3,337
4
32
51

10

votes

1 answer

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of…

apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Jul 02 '17 at 00:27

Larissa Leite

1,358
3
21
36

10

votes

1 answer

How to get classification probabilities from PySpark MultilayerPerceptronClassifier?

I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels. My question is, is it possible to get not only the…

apache-spark machine-learning neural-network pyspark apache-spark-ml

asked Apr 26 '17 at 10:04

Ondrej

414
1
5
13

10

votes

1 answer

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly…

apache-spark apache-spark-mllib apache-spark-ml

asked Oct 26 '16 at 12:38

Kobe-Wan Kenobi

3,694
2
40
67

10

votes

3 answers

Serialization issues in Spark Streaming

I'm quite confused about how Spark works with the data under the hood. For example, when I run a streaming job and apply foreachRDD, the behaviour depends on whether a variable is captured from the outer scope or initialised inside. val sparkConf =…

apache-spark apache-spark-sql spark-streaming apache-spark-ml

asked Sep 26 '16 at 18:12

lizarisk

7,562
10
46
70

10

votes

2 answers

SPARK, ML, Tuning, CrossValidator: access the metrics

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) …

apache-spark apache-spark-mllib apache-spark-ml

asked Jan 08 '16 at 13:59

Rami

8,044
18
66
108

10

votes

1 answer

Using Spark ML's OneHotEncoder on multiple columns

I've been able to create a pipeline that will allow me to index multiple string columns at once, but I am getting stuck encoding them, because unlike indexing, the encoder is not an estimator so I never call fit according to the OneHotEncoder…

scala apache-spark apache-spark-ml

asked Dec 08 '15 at 22:16

Michael Discenza

3,240
7
30
41

10

votes

2 answers

Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.

scala apache-spark apache-spark-mllib random-forest apache-spark-ml

asked Aug 15 '15 at 21:02

deniswsrosa

2,421
1
17
25

9

votes

3 answers

Spark v3.0.0 - WARN DAGScheduler: broadcasting large task binary with size xx

I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set: SparkConf conf = new SparkConf(); conf.setMaster("local[*]"); conf.set("spark.driver.memory",…

java apache-spark apache-spark-mllib apache-spark-ml

asked Sep 02 '20 at 10:52

vittoema96

121
1
1
6

9

votes

1 answer

In Spark ML, why is fitting a StringIndexer on a column with million of disctinct values yielding an OOM error?

I am trying to use Spark's StringIndexer feature transformer on a column with about 15.000.000 unique string values. Regardless of how many resources I throw at it, Spark always dies on me with some sort of Out Of Memory exception. from…

apache-spark pyspark apache-spark-ml

asked Aug 24 '18 at 08:12

Interfector

1,868
1
23
43

9

votes

2 answers

Online learning of LDA model in Spark

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?

apache-spark machine-learning apache-spark-mllib lda apache-spark-ml

asked Mar 08 '17 at 18:11

mathieu

2,330
2
24
44

9

votes

4 answers

Relating column names to model parameters in pySpark ML

I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a…

python pyspark apache-spark-ml

asked Aug 18 '16 at 15:26

Jeff

2,158
1
16
29

9

votes

1 answer

How to combine n-grams into one vocabulary in Spark?

Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to…

python apache-spark nlp pyspark apache-spark-ml

asked Aug 08 '16 at 23:29

Evan Zamir

8,059
14
56
83

9

votes

1 answer

Non linear (DAG) ML pipelines in Apache Spark

I've set-up a simple Spark-ML app, where I have a pipeline of independent transformers that add columns to a dataframe of raw data. Since the transformers don't look at the output of one another I was hoping I could run them in parallel in a…

apache-spark apache-spark-mllib apache-spark-ml

asked May 31 '16 at 09:19

hillel

2,343
2
18
25

Questions tagged [apache-spark-ml]