Question as above^ I'm using PySpark to find clusters in a dataset. I have used the following code so far:
# Vectorizing
from pyspark.ml.feature import VectorAssembler
assembler1 = VectorAssembler(
inputCols=["cons_fraud_prob", "merch_fraud_prob"],
outputCol= "features")
transformed_model = assembler1.transform(model_data)
from pyspark.ml.clustering import GaussianMixture
from pyspark.ml.linalg import Vectors
gmm = GaussianMixture().setK(3).setSeed(14)
model = gmm.fit(transformed_model)
print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)
But what I want to do now is include a column that has the cluster that each datapoint belongs to. How can I implement this?