Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
25
votes
4 answers

What is the difference between HashingTF and CountVectorizer in Spark?

Trying to do doc classification in Spark. I am not sure what the hashing does in HashingTF; does it sacrifice any accuracy? I doubt it, but I don't know. The spark doc says it uses the "hashing trick"... just another example of really bad/confusing…
Kai
  • 1,464
  • 4
  • 18
  • 31
25
votes
4 answers

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows: pca = PCA(k=3, inputCol="features", outputCol="pca_features") model = pca.fit(data) where data is a Spark DataFrame with one…
nanounanue
  • 7,942
  • 7
  • 41
  • 73
24
votes
2 answers

How to create a custom Estimator in PySpark

I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator. I also don't understand what @keyword_only does and why do I need so many…
24
votes
1 answer

Create feature vector programmatically in Spark ML / pyspark

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa',…
zoltanctoth
  • 2,788
  • 5
  • 26
  • 32
23
votes
2 answers

Save ML model for future usage

I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark …
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
22
votes
5 answers

Spark, ML, StringIndexer: handling unseen labels

My goal is to build a multicalss classifier. I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step. The…
Rami
  • 8,044
  • 18
  • 66
  • 108
22
votes
2 answers

How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?
20
votes
1 answer

How to merge multiple feature vectors in DataFrame?

Using Spark ML transformers I arrived at a DataFrame where each row looks like this: Row(object_id, text_features_vector, color_features, type_features) where text_features is a sparse vector of term weights, color_features is a small 20-element…
Felipe
  • 11,557
  • 7
  • 56
  • 103
19
votes
3 answers

What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]. I know…
Hereme
  • 193
  • 1
  • 1
  • 5
19
votes
2 answers

Apache Spark throws NullPointerException when encountering missing feature

I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file: x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1 where I have one missing value…
serge_k
  • 1,772
  • 2
  • 15
  • 21
18
votes
3 answers

pyspark extract ROC curve?

Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html Is that right? I can certainly think of ways…
seth127
  • 2,594
  • 5
  • 30
  • 43
18
votes
3 answers

How to prepare data into a LibSVM format from DataFrame?

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the…
18
votes
1 answer

Caching intermediate results in Spark ML pipeline

Lately I'm planning to migrate my standalone python ML code to spark. The ML pipeline in spark.ml turns out quite handy, with streamlined API for chaining up algorithm stages and hyper-parameter grid search. Still, I found its support for one…
zaxliu
  • 2,726
  • 1
  • 22
  • 26
17
votes
2 answers

KMeans clustering in PySpark

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to…
17
votes
1 answer

SparkException: Values to assemble cannot be null

I want use StandardScaler to normalize the features. Here is my code: val Array(trainingData, testData) = dataset.randomSplit(Array(0.7,0.3)) val vectorAssembler = new…
April
  • 819
  • 2
  • 12
  • 23
1
2
3
61 62