Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
0
votes
0 answers

MinHashLSH Issue on PySpark

I am trying to run a text similarity analysis using PySpark. After vectorizing my text inputs using CountVectorizer with a vocabSize=5000 I am running an approxSimilarityJoin on the data. When I do this, I get an error related to non-zero values on…
Manuel Martinez
  • 798
  • 9
  • 14
0
votes
0 answers

How do I write a custom MLlib class that works with Databricks' autologging feature?

I'm trying to write a custom version of CrossValidator that evaluates (and averages out over the folds) a list of metrics and, as per normal, only picks the best of a particular chosen one (eg: the first of the list). But in doing so, I'd like it…
Mateus
  • 13
  • 5
0
votes
0 answers

How do I write custom MLflow friendly MLlib Classes?

I'm trying to write a few custom classes to work with the existing MLlib codebase and MLflow on Databricks. For example, write a transformer, estimator, or extend an existing MLlib class and be able to add to a pipeline, fit it (if necessary), log…
Mateus
  • 13
  • 5
0
votes
0 answers

Spark ML throws exception for Decision Tree classification: Column features must be of type numeric but was actually of type struct

I am trying to create a Spark ML model with the Decision Tree Classifier to perform classification , but I am getting an error saying the features in my training set should be of type numeric instead of type struct. Here is the minimal reproducible…
0
votes
0 answers

numpy_input IndexError: index 1 is out of bounds for axis 2 with size 1

Here's my code, modified from this github code Github repo in which I'm trying to classify a set of images containing MRI scan for classifying them into cancer not cancer (0-1). As you can see in the below code I got an error after defining the…
0
votes
0 answers

Batching large input file into MLlib model

Is there any way to batch a large input file (111MB) made of 22 MLN cells (222 rows for 110k columns) in MLlib (something similar to this tutorial made in keras) Keras batching tutorial. The file contains the actual features extracted from 222…
0
votes
0 answers

Converting CoordinateMatrix to Pyspark Dataframe

how can I convert a CoordinateMatrix to a Pyspark Dataframe? I have tried to convert my dataframe to a rowmatrix and then to a dataframe using this df.toRowMatrix().rows.map(lambda x: (x, )).toDF() but it looks really weird. | …
Johnas
  • 296
  • 2
  • 5
  • 15
0
votes
1 answer

TypeError: Cannot convert type into Vector

I have a dataframe with multiple rows that look like this: df.head() gives: Row(features=DenseVector([1.02, 4.23, 4.534, 0.342])) Now I want to compute the columnSimilarities() on my dataframe, and I do the following: rdd2 = df.rdd mat =…
Johnas
  • 296
  • 2
  • 5
  • 15
0
votes
1 answer

PySpark to PMML - Failed to build PMML file

Currently, I am working on a simple machine learning program that generates a PMML. For this experiment, I use PySpark as machine learning library and pyspark2pmml as PMML builder. I have a problem when I want to build a PMML file. All the process…
furanzup
  • 91
  • 1
  • 8
0
votes
0 answers

Keras/Elephas on PySpark: Could not serialize object: TypeError: can't pickle weakref objects

Trying to train a model on pyspark using elephas but keep getting he following error when fitting the model/estimator. Am using PySpark ML, using transformers to transform the data from raw form to vectorised form. Trying to use keras and elephas to…
0
votes
0 answers

Is it possible to utilise pyspark-ml methods with a window partitionBy approach?

Essentially, I have a dataset with thousands of distinct machines (with unique Id's) and variables measuring their operation on a daily basis, as in: ID|Var1|Var2 A | 99 | 51 A | 76 | 49 B | 40 | 8 B | 33 | 10 My objective is to use pyspark.ml…
0
votes
1 answer

Do the variables through VectorIndexer gets treated as categorical or numeric in XGBoost?

Lets say I have a variable which is a string variable and I transform this string variable using vectorIndexer. Now when I train a XGBoost model using this variable, will this variable be treated as numeric or categorical? Basically, I wanted to…
0
votes
0 answers

Problem with MLlib in pyspark linear regression when returning p-values

I am using PySpark MLlib to fit a linear regression model without regularization. Here is a what I am using def fit_linear_regression(data_frame, weights): # elasticNetParam=1 and regParam >0.0 enforces a lasso for regularization lr =…
armin
  • 591
  • 3
  • 10
0
votes
0 answers

how to calculate rsquared value for linear regression in Pyspark?

df = spark.read.csv('test.csv', header = True, inferSchema = True) [trainingDF, testingDF] = df.randomSplit([0.8, 0.2]) from pyspark.ml.regression import LinearRegression from pyspark.ml import Pipeline from pyspark.ml.feature import…
priston
  • 47
  • 6
0
votes
0 answers

Saving NLP vectorization models in MLFlow Databricks

I am quite new to MLFlow. I was using a Hashing TF-IDF vectorizer and a Logistic Regression Model(from pyspark ML) for working on a basic NLP problem. I am using MLFlow to track the model training and to log the model. I need to use this model in…