Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
0 answers

how to reduce time of spark application run by "java -jar"

I execute a spark application two ways. Application is Naive bayes training using MlLib. using "spark-submit": then successful execute on a set of data. using "java -jar": then take more time form case 1. In both case have same data set and same…
Tinku
  • 751
  • 5
  • 19
1
vote
1 answer

Twitter sentiment analysis using Naive Bayes in apache spark

I am trying to do a basic twitter sentiment analysis, by using apache spark. The below page explains on Naive Bayes function used at apache spark which would be a candidate for the above…
Siva
  • 1,839
  • 5
  • 21
  • 31
1
vote
1 answer

Classification with Spark MLlib in Java

I am trying to build a classification system with Apache Spark's MLlib. I have shortlisted Naive Bayes algorithm to do this, and will be using Java 8 for the support of Lambda expressions. I am a newbie in terms of lambda expressions and hence am…
jatinpreet
  • 589
  • 1
  • 4
  • 11
1
vote
1 answer

Apache Spark - MLlib - K-Means Input format

I want to perform a K-Means task and fail training the model and get kicked out of Sparks scala shell before I get my result metrics. I am not sure if the input format is the problem or something else. I use Spark 1.0.0 and my input textile (400MB)…
user3400996
  • 73
  • 1
  • 9
1
vote
1 answer

Passing long values into MLlib's Rating() method

I am trying to build a recommender system using Spark's MLlib library. (using Scala) In order to be able to use the ALS train method , I need to build a rating matrix using the Rating() method (which is a part the package…
shahharsh2603
  • 73
  • 2
  • 9
0
votes
0 answers

spark-job on spark kubernetes cluster took long time to complete

I have setup 3 node spark kubernetes cluster with spark-kubernetes-operator helm-chart. The kubernetes cluster deployed on aws t2.2xlarge instances with 8 vcpus and 32gb memory. I have build RandomForest price prediction spark-pipeline with Scala…
0
votes
0 answers

Is weightcol of spark decision tree classifier used directly in impurity calculation?

My dataset is > 300mil rows so I think spark-ml would be a better choice than sklearn. Since there are many rows with same set of features (they are different data points), I furtherly aggregate the dataset by feature and target and produce a weight…
0
votes
0 answers

How to update required memory for single node Apache Spark Scala Job?

I have been running Spark scala Job using 32 cores in a single node Apache Spark-3.2.x cluster. The actual host machine consists of 256 GB RAM and 128 cores. How can I provide the memory required for the entire Job? At present this Job processes 10M…
user648330
  • 25
  • 3
0
votes
0 answers

How to change a sparse vector column in a dataframe into a dense one with PySpark? (or how to translate my Scala function to a pySpark one?)

I'm trying to tune a LLM (Bert or embeddings such as Glove) on a text column for text classification. I'm using SparkNLP for preprocessing and creating the embeddings, and PySpark (Spark ML) for the machine learning part. I'm at a point that I have…
0
votes
1 answer

Using PipelineModel.load() in custom MLFlow PyFunc class results in error

Am creating a custom PyFunc class to use with Databricks Feature Store as their Model Serving UI and feature store's log_model() methods only work with the PythonModel class. The underlying model is a PipelineModel() which perform various binning…
0
votes
0 answers

How to substract one DenseVector from another in Spark MLLib

a and b are two spark.mllib.linalg.DenseVectors: import org.apache.spark.mllib.linalg.DenseVector ... val a = DenseVector(Array(1.0, 2.0, 3.0)) val b = DenseVector(Array(2.0, 3.0, 1.0)) ... I want to get c as subtraction of b from c. How to do it?
0
votes
1 answer

Spark KMeans produces deterministic results and not random

I am running Spark KMeans and I would like to have random seeds in every run for different results every time, however this is not the case. This is the code that I am using: KMeans kmeans = new KMeans().setK(4).setInitMode("random"); KMeansModel…
Des0lat0r
  • 482
  • 3
  • 18
0
votes
1 answer

How to load spark saved pipeline and retrain with new data

I hope to load a saved pipeline with spark and than re-fit it with new data collected in a day by day strategy. Here is my current code: new_data_df = data in current day if target path exists: model = PipelineModel.load("path/to/pipeline") …
G_cy
  • 994
  • 3
  • 13
  • 28
0
votes
1 answer

PySpark ArrayIndexOutOfBoundsException error during model fit: How can I diagnose and fix the issue?

I am working on a PySpark project where I'm trying to fit a MultilayerPerceptronClassifier model to my text data using the fit method.I am using the Word2ve model provided bu Mllib to extract features . However, I keep running into an…
0
votes
0 answers

Apache Spark MLlib StandardScaler vs z-score

So, I am wondering if there is any difference between the StandardScaler of Spark and a simple z-score calculation. The formula for the z-score calculation is: z = (x-mean)/Std However for the StandardScaler of Spark it is not clear to me how…
Des0lat0r
  • 482
  • 3
  • 18