Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
0
votes
1 answer

matrix factorization model returning much smaller dataframe after predicting ratings in pyspark

I'm trying to create a product recommender with the code below. I'm using matrix factorization from spark ml. I have data that has a customer_id, product_id, and a numeric rating value that has been normalized. So all rating values are between 0…
0
votes
1 answer

How do I extract feature_importances from my model in SparklyR?

I would like to extract feature_importances from my model in SparklyR. So far I have the following reproducible code that is working: library(sparklyr) library(dplyr) sc <- spark_connect(method = "databricks") dtrain <- data_frame(text =…
0
votes
1 answer

Generate sparse vector for all the column values in spark dataframe

column1 column2 1 1 1 0 1 0 0 0 Now I want to calculate the hash or sparse vector of all the values in column1 and column2
0
votes
1 answer

How to groupBy and perform data scaling over each and every group using MlLib Pyspark?

I have a dataset just like in the example below and I am trying to group all rows from a given symbol and perform standard scaling of each group so that at the end all my data is scaled by groups. How can I do that with MlLib and Pyspark? I could…
0
votes
1 answer

Pyspark Pipeline Performance

Is there any performance difference between using 2 separate pipelines vs 1 combined pipeline? For example, 2 separate pipelines: from pyspark.ml import Pipeline from pyspark.ml.feature import VectorAssembler df = spark.createDataFrame([ (1.0,…
Tim
  • 3,178
  • 1
  • 13
  • 26
0
votes
1 answer

How to convert a DataFrame to an Array of dense vectors?

How would I convert the following DataFrame val df = Seq( (5.0, 1.0, 1.0, 3.0, 7.0), (2.0, 0.0, 3.0, 4.0, 5.0), (4.0, 0.0, 0.0, 6.0, 7.0)).toDF("m1", "m2", "m3", "m4", "m5") //df: res166: org.apache.spark.sql.DataFrame = [m1: int, m2: int ...…
Amazonian
  • 391
  • 2
  • 8
  • 22
0
votes
1 answer

Adding custom metadata to DataFrame schema using iceberg table format

I'm adding custom metadata into the DataFrames schema in my PySpark application using StructField's metadata field It worked fine when I wrote parquet files directly into s3. The custom metadata was available when reading these parquet files as…
0
votes
1 answer

Training/Test data with SparkML in Scala

I've been facing with an issue for the past couple of hours. In theory, when we split data for training and testing, we should standardize the data for training independently, so as not to introduce bias, and then after having trained the model do…
Aron Latis
  • 38
  • 1
  • 6
0
votes
1 answer

ML Tuning - Cross Validation in Spark

I am looking the cross validation code example found in https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation It says: CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test…
rayqz
  • 249
  • 1
  • 8
0
votes
1 answer

Mix Smark MLLIB and SparkNLP in pipeline

In a MLLIB pipeline, how can I chain a CountVectorizer (from SparkML) after a Stemmer (from Spark NLP) ? When I try to use both in a pipeline I get: myColName must be of type equal to one of the following types: [array, array] but…
Benjamin
  • 3,350
  • 4
  • 24
  • 49
0
votes
1 answer

Vertex ai custom model training for pyspark ml model

Is it possible to train a spark/pyspark ML lib model using VertexAI custom container model building? I couldn't find any reference in the vertex ai documents regarding spark model training. For distributed processing model building only options…
0
votes
0 answers

alternative to pivoting column to create vector for kmeans in pyspark

I am trying to cluster with kmeans in pyspark. I have data like the id_predictions_df example below. I'm first pivoting the data to create a dataframe where the columns are the id_y indices and the rows would be the id_x. The values are then the…
user3476463
  • 3,967
  • 22
  • 57
  • 117
0
votes
0 answers

How to decode the one hot encoder values in spark ml

Is it possible to perform oneHotDecoder after using OneHotEncoder in spark ml? Is there any way to achieve this? StringIndexer dateIndexer = new StringIndexer(); csvData =…
0
votes
1 answer

Java Spark ML - java.lang.IllegalArgumentException: label does not exist. Available:

Small question regarding a Spark exception I am getting please. I have a very straightforward dataset: myCoolDataset.show(); +----------+-----+ | time|value| +----------+-----+ |1621900800| 43| …
PatPanda
  • 3,644
  • 9
  • 58
  • 154
0
votes
1 answer

How to specify "positive class" in sparkml classification?

How to specify the "positive class" in sparkml (binary) classification? (Or perhaps: How does a MulticlassClassificationEvaluator determine which class is the "positive" one?) Suppose we were training a model to target Precision in a binary…
lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102
1 2 3
99
100