Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
-1
votes
1 answer

Pyspark NLP - CountVectorizer Max DF or TF. How to filter common occurrences from dataset

I am using CountVectorizer to ready a dataset for ML. I want to filter out the rare words and I use the parameter of CountVectorizer, minDF or minTF for that. I would also like to remove items that appear 'often' in my dataset. I do not see a maxTF…
JB5
  • 97
  • 2
  • 8
-1
votes
1 answer

convert Seq[(String, Any)] to Seq[(String, org.apache.spark.ml.PredictionModel[_, _])] in spark

i had trained my dataset into different models such as nbModel, dtModel, rfModel, GbmModel . All these are machine learning models now when i am saving it into a variable as val models = Seq(("NB", nbModel), ("DT", dtModel), ("RF", rfModel),…
-1
votes
1 answer

type mismatch error while running ml.PredictionModel in spark

After training all the model, i am trying to rename each model prediction column to uniquely identify the model prediction inside the dataset.I am getting type mismatch error as specified below : import org.apache.spark.ml.PredictionModel import…
Parv bali
  • 147
  • 1
  • 11
-1
votes
1 answer

Spark ML- prediction in KMeans

I have created a KMeans model using Spark ML methods. val kmeans = new KMeans() val model = kmeans.fit(df) I got my model ready. But how to predict that in which cluster new data points will fall. In MLlib, model.predict(Vector) predict the cluster…
Ishan Kumar
  • 1,941
  • 3
  • 20
  • 29
-1
votes
3 answers

PCA() got an unexpected keyword argument 'k'

I am trying t perform pca from a spark application using PySpark API on a python script. I doing This way: pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures") PCAmodel = pca.fit(data) when I run those two code line in the pyspark shell it…
user5492457
-1
votes
1 answer

Regression in PySpark. Which library to Use

What are the differences between "pyspark.mllib.regression" and "pyspark.ml.regression" Which one should be used
Shiv
  • 369
  • 2
  • 13
-1
votes
1 answer

How can we compare the decision trees algorithm performance in terms of accuracy from scikit-learn and from Spark ML?

I am comparing the accuracy for text classification obtained using sklearn DT and Spark ML DT with same features and dataset. Is it appropriate to even compare them? The reason being, the parameters list is different for both of them so I think…
-1
votes
2 answers

can't define a udf inside pyspark project

I have a python project that uses pyspark and i am trying to define a udf function inside the spark project (not in my python project) specifically in spark\python\pyspark\ml\tuning.py but i get pickling problems. it can't load the udf. The…
ofer-a
  • 521
  • 5
  • 21
-1
votes
1 answer

Exception on using VectorAssembler in apache spark ml

I'm trying to create a vectorAssembler to create an input for logistic regression and am using the following code : //imports import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.{Vector, Vectors, VectorUDT} 1…
hbabbar
  • 947
  • 4
  • 15
  • 33
-3
votes
1 answer

How to process dataframe for ML using Pyspark

I am doing GBT modelling in using pyspark. I have a dataframe, the features for input (X) are multiple columns: A,B,C the output (Y) is one column with binary values 0 and 1. I am confused with the VectorAssembler and transform in processing the…
1 2 3
61
62