Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
-1
votes
1 answer

Why does StringIndexer has no outputCols?

I am using Apache Zeppelin. My anaconda version is conda 4.8.4. and my spark version is: %spark2.pyspark spark.version u'2.3.1.3.0.1.0-187' When I run my code, it throws followed error: Exception AttributeError: "'StringIndexer' object has no…
JAdel
  • 1,309
  • 1
  • 7
  • 24
-1
votes
1 answer

Getting the error java.lang.NullPointerException when running application through spark-submit

Exception 2020-10-31 18:00:40,904 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NullPointerException at…
-1
votes
1 answer

pyspark random forest regressor predict multiclass

I have randomforest regressor pyspark ml model .response variable is of 9 classses. When I predict the test data I am getting probability I need to get the classes instead. Code used: rf =…
-1
votes
1 answer

How to convert RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am trying to convert RDD[Row] to RDD[Vector] but it throws exception stating java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.Vector My code is val spark =…
Asif
  • 763
  • 8
  • 18
-1
votes
1 answer

One-hot encoding multiple variables with Spark 2.1.1

I'm required to use Spark 2.1.1 and have a simple ML use case where I fit a logistic regression to perform a classification based on both continuous and categorical variables. I automatically detect categorical variables and index them in the ML…
mobupu
  • 245
  • 3
  • 10
-1
votes
1 answer

NGram on dataset with one word

I'm dabbling with SparkML, trying to build out a fuzzy match using Spark's OOB capabilities. Along the way, I'm building NGrams with n=2. However, some lines in my dataset contains single words where Spark pipeline fails. Regardless of Spark,…
Sahas
  • 3,046
  • 6
  • 32
  • 53
-1
votes
1 answer

The i/p col features must be either string or numeric type, but got org.apache.spark.ml.linalg.VectorUDT

I am very new to Spark Machine Learning just an 3 day old novice and I'm basically trying to predict some data using Logistic Regression algorithm in spark via Java. I have referred few sites and documentation and came up with the code and i am…
-1
votes
1 answer

Spark Transformers [Scala]: Knowing schema transformation result before feeding the full data

Is there a method I could use If I want to know how a Transformer changes the schema; without providing the data? For example I have a large DataFrame but I don't want to use it with the transformer; I just want to know the occurring schema…
o-0
  • 1,713
  • 14
  • 29
-1
votes
1 answer

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and…
atheodos
  • 131
  • 12
-1
votes
1 answer

Update Machine Learning models in Mllib dataframe based PySpark (2.2.0)

I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need. Note, required method 'partial_fit'…
-1
votes
1 answer

Logistic Regression for Numerical data

I have these labels and features like labels features [2.3] 1 5.1 7.2 5 5 5 [5.4] 4.5 3 2 4 6 4 [6.3] 3.3 1.3 5.4 6 Like this, I have more than 10K entries. How can I use Logistic regression to train a model in spark? I know we can use…
-1
votes
1 answer

How does Spark model treat vector column?

How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them…
-1
votes
1 answer

Spark ML API to convert a vector to a probability for multilabel classification

I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the…
user2103008
  • 414
  • 7
  • 19
-1
votes
1 answer

how to name kmeans clusters in pyspark

I have the following code: %pyspark from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler from pyspark.ml.clustering import KMeans from pyspark.ml import Pipeline (trainingData, testData) = dataFrame.randomSplit([0.7,…
-1
votes
1 answer

java.lang.UnsupportedOperationException: Schema for type breeze.linalg.Vector[Int] is not supported

I have a dataframe with column of type Array[Array[Int]], I am trying to add up the array values using breeze api, however I am getting a schema for type not supported error. input…
Masterbuilder
  • 499
  • 2
  • 12
  • 24
1 2 3
61
62