I need to apply PCA to a CSV. Looking for examples I have seen one as an answer to a question on this site. I have tried to follow it but it gives me an error. If someone could help me? My code is this:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA as PCAml
spark = SparkSession.Builder().appName("read_csv").master("local[*]").getOrCreate()
if __name__ == "__main__":
dfX = spark.read.format("csv").option("inferSchema",True).option("header",True).load("inmune_X.csv")
dfX=dfX.drop('_c0')
assembler = VectorAssembler(inputCols=dfX.columns,outputCol="atributos")
dd = assembler.transform(dfX)
dd.select("atributos").show(5)
pca = PCAml(k=4, inputCol="atributos", outputCol="pca")
model = pca.fit(assembler)
transformed = model.transform(assembler)
transformed.show(5)
This is the error message:
File "D:\UGR\Investigación\Cosas de Reinaldo\mis script\Seleccion_caracteristicas\pca_rowmatrix_example.py", line 65, in model = pca.fit(assembler)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\ml\base.py", line 129, in fit return self._fit(dataset)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\ml\wrapper.py", line 321, in _fit java_model = self._fit_java(dataset)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\ml\wrapper.py", line 318, in _fit_java return self._java_obj.fit(dataset._jdf)
AttributeError: 'VectorAssembler' object has no attribute '_jdf'