Highest Voted 'apache-spark-ml' Questions

-1

votes

1 answer

Why does StringIndexer has no outputCols?

I am using Apache Zeppelin. My anaconda version is conda 4.8.4. and my spark version is: %spark2.pyspark spark.version u'2.3.1.3.0.1.0-187' When I run my code, it throws followed error: Exception AttributeError: "'StringIndexer' object has no…

pyspark apache-zeppelin apache-spark-ml

asked Dec 29 '21 at 18:30

JAdel

1,309
1
7
24

-1

votes

1 answer

Getting the error java.lang.NullPointerException when running application through spark-submit

Exception 2020-10-31 18:00:40,904 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NullPointerException at…

scala apache-spark apache-spark-sql spark-streaming apache-spark-ml

asked Oct 31 '20 at 12:46

Niraj kumar

11
4

-1

votes

1 answer

pyspark random forest regressor predict multiclass

I have randomforest regressor pyspark ml model .response variable is of 9 classses. When I predict the test data I am getting probability I need to get the classes instead. Code used: rf =…

machine-learning pyspark random-forest apache-spark-ml

asked Jun 24 '20 at 17:30

Naveen Srikanth

739
3
11
23

-1

votes

1 answer

How to convert RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am trying to convert RDD[Row] to RDD[Vector] but it throws exception stating java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.Vector My code is val spark =…

scala apache-spark rdd apache-spark-mllib apache-spark-ml

asked Apr 24 '20 at 19:45

Asif

763
8
18

-1

votes

1 answer

One-hot encoding multiple variables with Spark 2.1.1

I'm required to use Spark 2.1.1 and have a simple ML use case where I fit a logistic regression to perform a classification based on both continuous and categorical variables. I automatically detect categorical variables and index them in the ML…

scala apache-spark apache-spark-ml

asked Mar 11 '20 at 09:25

mobupu

245
3
10

-1

votes

1 answer

NGram on dataset with one word

I'm dabbling with SparkML, trying to build out a fuzzy match using Spark's OOB capabilities. Along the way, I'm building NGrams with n=2. However, some lines in my dataset contains single words where Spark pipeline fails. Regardless of Spark,…

apache-spark nlp apache-spark-mllib apache-spark-ml n-gram

asked Feb 25 '20 at 01:49

Sahas

3,046
6
32
53

-1

votes

1 answer

The i/p col features must be either string or numeric type, but got org.apache.spark.ml.linalg.VectorUDT

I am very new to Spark Machine Learning just an 3 day old novice and I'm basically trying to predict some data using Logistic Regression algorithm in spark via Java. I have referred few sites and documentation and came up with the code and i am…

java scala apache-spark logistic-regression apache-spark-ml

asked Jan 24 '20 at 10:30

Akshay Kumar

3
1

-1

votes

1 answer

Spark Transformers [Scala]: Knowing schema transformation result before feeding the full data

Is there a method I could use If I want to know how a Transformer changes the schema; without providing the data? For example I have a large DataFrame but I don't want to use it with the transformer; I just want to know the occurring schema…

scala apache-spark apache-spark-ml

asked Jun 13 '19 at 14:25

o-0

1,713
14
29

-1

votes

1 answer

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and…

scala apache-spark apache-spark-ml minhash lsh

asked Feb 16 '19 at 13:48

atheodos

131
12

-1

votes

1 answer

Update Machine Learning models in Mllib dataframe based PySpark (2.2.0)

I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need. Note, required method 'partial_fit'…

apache-spark machine-learning pyspark cluster-analysis apache-spark-ml

asked Nov 19 '18 at 12:05

bioinformatician

364
1
12
27

-1

votes

1 answer

Logistic Regression for Numerical data

I have these labels and features like labels features [2.3] 1 5.1 7.2 5 5 5 [5.4] 4.5 3 2 4 6 4 [6.3] 3.3 1.3 5.4 6 Like this, I have more than 10K entries. How can I use Logistic regression to train a model in spark? I know we can use…

apache-spark apache-spark-mllib logistic-regression apache-spark-ml

asked Sep 26 '18 at 15:37

Kiran Gali

101
6

-1

votes

1 answer

How does Spark model treat vector column?

How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them…

apache-spark machine-learning pyspark apache-spark-ml

asked Sep 17 '18 at 01:20

Gregorius Edwadr

399
1
3
14

-1

votes

1 answer

Spark ML API to convert a vector to a probability for multilabel classification

I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the…

scala apache-spark machine-learning apache-spark-ml

asked Sep 08 '18 at 23:00

user2103008

414
7
19

-1

votes

1 answer

how to name kmeans clusters in pyspark

I have the following code: %pyspark from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler from pyspark.ml.clustering import KMeans from pyspark.ml import Pipeline (trainingData, testData) = dataFrame.randomSplit([0.7,…

python-3.x apache-spark pyspark apache-spark-sql apache-spark-ml

asked Jul 27 '18 at 07:41

djdjfd

1
1

-1

votes

1 answer

java.lang.UnsupportedOperationException: Schema for type breeze.linalg.Vector[Int] is not supported

I have a dataframe with column of type Array[Array[Int]], I am trying to add up the array values using breeze api, however I am getting a schema for type not supported error. input…

apache-spark apache-spark-mllib apache-spark-ml

asked Jul 23 '18 at 22:30

Masterbuilder

499
2
12
24

Questions tagged [apache-spark-ml]