Highest Voted 'apache-spark-mllib' Questions

27

votes

6 answers

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025. Given that there…

apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Dec 30 '16 at 16:25

TechnoIndifferent

1,034
1
10
10

27

votes

1 answer

Encode and assemble multiple features in PySpark

I have a Python class that I'm using to load and process some data in Spark. Among various things I need to do, I'm generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I'm not sure how to…

python apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

asked Oct 07 '15 at 01:40

moustachio

2,924
3
36
68

25

votes

4 answers

What is the difference between HashingTF and CountVectorizer in Spark?

Trying to do doc classification in Spark. I am not sure what the hashing does in HashingTF; does it sacrifice any accuracy? I doubt it, but I don't know. The spark doc says it uses the "hashing trick"... just another example of really bad/confusing…

apache-spark apache-spark-mllib apache-spark-ml

asked Feb 04 '16 at 16:06

Kai

1,464
4
18
31

24

votes

2 answers

How to create a custom Estimator in PySpark

I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator. I also don't understand what @keyword_only does and why do I need so many…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked May 17 '16 at 08:04

Hanan Shteingart

8,480
10
53
66

23

votes

2 answers

Save ML model for future usage

I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark …

apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Oct 08 '15 at 23:50

Alberto Bonsanto

17,556
10
64
93

23

votes

3 answers

Sparse Vector vs Dense Vector

How to create SparseVector and dense Vector representations if the DenseVector is: denseV = np.array([0., 3., 0., 4.]) What will be the Sparse Vector representation ?

apache-spark apache-spark-mllib

asked Jul 20 '15 at 17:37

Anoop Toffy

918
1
9
22

22

votes

1 answer

How to evaluate a classifier with PySpark 2.4.5

I'm wondering what the best way is to evaluate a fitted binary classification model using Apache Spark 2.4.5 and PySpark (Python). I want to consider different metrics such as accuracy, precision, recall, auc and f1 score. Let us assume that the…

python apache-spark pyspark apache-spark-mllib evaluation

asked Mar 20 '20 at 10:23

Jannik

965
2
12
21

22

votes

1 answer

Matrix Multiplication in Apache Spark

I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD that can represent matrix in Apache Spark? How to multiply two such RDDs?

java scala apache-spark rdd apache-spark-mllib

asked Nov 06 '15 at 03:03

Jigar

468
1
5
15

22

votes

2 answers

How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?

apache-spark random-forest cross-validation apache-spark-ml apache-spark-mllib

asked Sep 24 '15 at 19:37

ashishsjsu

365
1
2
9

21

votes

4 answers

Spark train test split

I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 release. So far I could only find…

apache-spark apache-spark-mllib train-test-split

asked Oct 12 '16 at 09:02

Georg Heiler

16,916
36
162
292

20

votes

2 answers

Gaussian Mixture Models: Difference between Spark MLlib and scikit-learn

I'm trying to use Gaussian Mixture models on a sample of a dataset. I used bothMLlib (with pyspark) and scikit-learn and get very different results, the scikit-learn one looking more realistic. from pyspark.mllib.clustering import GaussianMixture…

python apache-spark scikit-learn pyspark apache-spark-mllib

asked Jun 18 '18 at 18:49

ixaxaar

6,411
3
24
33

19

votes

2 answers

extracting numpy array from Pyspark Dataframe

numpy apache-spark pyspark apache-spark-sql apache-spark-mllib

asked Feb 08 '17 at 14:42

Uday Shankar Singh

531
1
5
17

18

votes

3 answers

Creating Spark dataframe from numpy matrix

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it. I've tried this: %pyspark import numpy as np from pyspark.ml.linalg…

numpy apache-spark pyspark apache-spark-sql apache-spark-mllib

asked Jul 12 '17 at 16:55

Jan Sila

1,554
3
17
36

18

votes

3 answers

How to prepare data into a LibSVM format from DataFrame?

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the…

apache-spark apache-spark-sql apache-spark-mllib libsvm apache-spark-ml

asked Jan 01 '17 at 14:44

Data diaboli

211
1
2
12

18

votes

3 answers

Apache Spark: StackOverflowError when trying to indexing string columns

I have csv file with about 5000 rows and 950 columns. First I load it to DataFrame: val data = sqlContext.read .format(csvFormat) .option("header", "true") .option("inferSchema", "true") .load(file) .cache() After that I search all string…

java scala apache-spark apache-spark-mllib

asked Jul 05 '16 at 14:34

Andrew Tsibin

258
3
11

Questions tagged [apache-spark-mllib]