-1

I am trying t perform pca from a spark application using PySpark API on a python script. I doing This way:

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
PCAmodel = pca.fit(data)

when I run those two code line in the pyspark shell it work and return good results, but in an application script, I am getting the type of error:

PCA() got an unexpected keyword argument 'k'

PS: In both case I am using Spark 2.2.0

where is the problem? why it does work in the PySpark shell and not for the application?

desertnaut
  • 57,590
  • 26
  • 140
  • 166

3 Answers3

3

You probably imported from ml in one case:

from pyspark.ml.feature import PCA

and mllib in the other:

from pyspark.mllib.feature import PCA
  • no, this not my case, for both case I'm using `from pyspark.ml.feature import PCA` –  Nov 15 '17 at 16:04
1

Are you sure you have not also imported PCA from scikit-learn, after you imported it from PySpark in your application script?

spark.version
# u'2.2.0'

from pyspark.ml.feature import PCA
from sklearn.decomposition import PCA

# PySpark syntax with scikit-learn PCA function
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
# Error:  
TypeError: __init__() got an unexpected keyword argument 'k'

Reversing the order of imports will not produce the error (not shown).

desertnaut
  • 57,590
  • 26
  • 140
  • 166
1

Try renaming your classes:

from pyspark.ml.feature import PCA as PCAML
from sklearn.decomposition import PCA as PCASK

pca_ml = PCAML(k=3, inputCol="features", outputCol="pcaFeatures")

There should be no confusion, then, which one you call.

MisterJT
  • 412
  • 4
  • 15