Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
27
votes
6 answers

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025. Given that there…
27
votes
1 answer

Encode and assemble multiple features in PySpark

I have a Python class that I'm using to load and process some data in Spark. Among various things I need to do, I'm generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I'm not sure how to…
25
votes
4 answers

What is the difference between HashingTF and CountVectorizer in Spark?

Trying to do doc classification in Spark. I am not sure what the hashing does in HashingTF; does it sacrifice any accuracy? I doubt it, but I don't know. The spark doc says it uses the "hashing trick"... just another example of really bad/confusing…
Kai
  • 1,464
  • 4
  • 18
  • 31
24
votes
2 answers

How to create a custom Estimator in PySpark

I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator. I also don't understand what @keyword_only does and why do I need so many…
23
votes
2 answers

Save ML model for future usage

I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark …
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
23
votes
3 answers

Sparse Vector vs Dense Vector

How to create SparseVector and dense Vector representations if the DenseVector is: denseV = np.array([0., 3., 0., 4.]) What will be the Sparse Vector representation ?
Anoop Toffy
  • 918
  • 1
  • 9
  • 22
22
votes
1 answer

How to evaluate a classifier with PySpark 2.4.5

I'm wondering what the best way is to evaluate a fitted binary classification model using Apache Spark 2.4.5 and PySpark (Python). I want to consider different metrics such as accuracy, precision, recall, auc and f1 score. Let us assume that the…
Jannik
  • 965
  • 2
  • 12
  • 21
22
votes
1 answer

Matrix Multiplication in Apache Spark

I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD that can represent matrix in Apache Spark? How to multiply two such RDDs?
Jigar
  • 468
  • 1
  • 5
  • 15
22
votes
2 answers

How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?
21
votes
4 answers

Spark train test split

I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 release. So far I could only find…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
20
votes
2 answers

Gaussian Mixture Models: Difference between Spark MLlib and scikit-learn

I'm trying to use Gaussian Mixture models on a sample of a dataset. I used bothMLlib (with pyspark) and scikit-learn and get very different results, the scikit-learn one looking more realistic. from pyspark.mllib.clustering import GaussianMixture…
ixaxaar
  • 6,411
  • 3
  • 24
  • 33
19
votes
2 answers

extracting numpy array from Pyspark Dataframe

I have a dataframe gi_man_df where group can be n: +------------------+-----------------+--------+--------------+ | group | number|rand_int| rand_double| +------------------+-----------------+--------+--------------+ | …
18
votes
3 answers

Creating Spark dataframe from numpy matrix

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it. I've tried this: %pyspark import numpy as np from pyspark.ml.linalg…
Jan Sila
  • 1,554
  • 3
  • 17
  • 36
18
votes
3 answers

How to prepare data into a LibSVM format from DataFrame?

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the…
18
votes
3 answers

Apache Spark: StackOverflowError when trying to indexing string columns

I have csv file with about 5000 rows and 950 columns. First I load it to DataFrame: val data = sqlContext.read .format(csvFormat) .option("header", "true") .option("inferSchema", "true") .load(file) .cache() After that I search all string…
Andrew Tsibin
  • 258
  • 3
  • 11