0

Question as above^ I'm using PySpark to find clusters in a dataset. I have used the following code so far:

# Vectorizing 
from pyspark.ml.feature import VectorAssembler
assembler1 = VectorAssembler(
inputCols=["cons_fraud_prob", "merch_fraud_prob"],
outputCol= "features")

transformed_model = assembler1.transform(model_data)

from pyspark.ml.clustering import GaussianMixture
from pyspark.ml.linalg import Vectors

gmm = GaussianMixture().setK(3).setSeed(14)
model = gmm.fit(transformed_model)

print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)

But what I want to do now is include a column that has the cluster that each datapoint belongs to. How can I implement this?

0 Answers0