Highest Voted 'apache-spark-ml' Questions

25

votes

4 answers

What is the difference between HashingTF and CountVectorizer in Spark?

Trying to do doc classification in Spark. I am not sure what the hashing does in HashingTF; does it sacrifice any accuracy? I doubt it, but I don't know. The spark doc says it uses the "hashing trick"... just another example of really bad/confusing…

apache-spark apache-spark-mllib apache-spark-ml

asked Feb 04 '16 at 16:06

Kai

1,464
4
18
31

25

votes

4 answers

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows: pca = PCA(k=3, inputCol="features", outputCol="pca_features") model = pca.fit(data) where data is a Spark DataFrame with one…

apache-spark apache-spark-sql pyspark pca apache-spark-ml

asked Oct 30 '15 at 04:19

nanounanue

7,942
7
41
73

24

votes

2 answers

How to create a custom Estimator in PySpark

I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator. I also don't understand what @keyword_only does and why do I need so many…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked May 17 '16 at 08:04

Hanan Shteingart

8,480
10
53
66

24

votes

1 answer

Create feature vector programmatically in Spark ML / pyspark

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa',…

python apache-spark pyspark apache-spark-ml

asked Sep 16 '15 at 10:39

zoltanctoth

2,788
5
26
32

23

votes

2 answers

Save ML model for future usage

I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark …

apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Oct 08 '15 at 23:50

Alberto Bonsanto

17,556
10
64
93

22

votes

5 answers

Spark, ML, StringIndexer: handling unseen labels

My goal is to build a multicalss classifier. I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step. The…

apache-spark apache-spark-ml

asked Jan 08 '16 at 16:20

Rami

8,044
18
66
108

22

votes

2 answers

How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?

apache-spark random-forest cross-validation apache-spark-ml apache-spark-mllib

asked Sep 24 '15 at 19:37

ashishsjsu

365
1
2
9

20

votes

1 answer

How to merge multiple feature vectors in DataFrame?

Using Spark ML transformers I arrived at a DataFrame where each row looks like this: Row(object_id, text_features_vector, color_features, type_features) where text_features is a sparse vector of term weights, color_features is a small 20-element…

apache-spark machine-learning apache-spark-sql apache-spark-ml

asked Oct 22 '15 at 05:01

Felipe

11,557
7
56
103

19

votes

3 answers

What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]. I know…

apache-spark-sql logistic-regression apache-spark-ml

asked Jun 19 '16 at 02:00

Hereme

193
1
1
5

19

votes

2 answers

Apache Spark throws NullPointerException when encountering missing feature

I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file: x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1 where I have one missing value…

python apache-spark apache-spark-sql pyspark apache-spark-ml

asked Nov 06 '15 at 20:02

serge_k

1,772
2
15
21

18

votes

3 answers

pyspark extract ROC curve?

Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html Is that right? I can certainly think of ways…

pyspark apache-spark-ml

asked Oct 17 '18 at 04:41

seth127

2,594
5
30
43

18

votes

3 answers

How to prepare data into a LibSVM format from DataFrame?

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the…

apache-spark apache-spark-sql apache-spark-mllib libsvm apache-spark-ml

asked Jan 01 '17 at 14:44

Data diaboli

211
1
2
12

18

votes

1 answer

Caching intermediate results in Spark ML pipeline

Lately I'm planning to migrate my standalone python ML code to spark. The ML pipeline in spark.ml turns out quite handy, with streamlined API for chaining up algorithm stages and hyper-parameter grid search. Still, I found its support for one…

apache-spark apache-spark-ml

asked Sep 14 '15 at 09:43

zaxliu

2,726
1
22
26

17

votes

2 answers

KMeans clustering in PySpark

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to…

machine-learning pyspark k-means apache-spark-mllib apache-spark-ml

asked Dec 01 '17 at 02:22

user3245256

1,842
4
24
51

17

votes

1 answer

SparkException: Values to assemble cannot be null

I want use StandardScaler to normalize the features. Here is my code: val Array(trainingData, testData) = dataset.randomSplit(Array(0.7,0.3)) val vectorAssembler = new…

apache-spark apache-spark-sql apache-spark-ml

asked Dec 28 '16 at 12:35

April

819
2
12
23

Questions tagged [apache-spark-ml]