0

Is there a way to implement dimension reduction using pyspark? I have a dataframe and loaded into pyspark.

FILENAME = "test.csv"
spark = SparkSession.builder.appName('Test')  \
    .getOrCreate()

spark_df = spark.read.csv(FILENAME, header=True)
# Load the embeddings from the spark_df
embedded_df_columns = spark_df.columns[5:]
embedded_df = spark_df.select(embedded_df_columns)

I dont seem to find the right pyspark.ml.features for tsne. All I get was for pca. Can anyone help please

  • If pyspark is not a hard requirement, it may be worthwhile trying cuml, which has a t-SNE implementation. It’s optimized for GPU and, depending on your dataset, compute resources, and use case, could offer exactly what you need. – Yang Wu Jan 17 '23 at 12:45
  • In my experience, I've found myself memory-constrained when trying to scale TSNE. Whether the original poster was interested in using Spark because they were CPU-limited or memory-limited will determine whether GPU acceleration is what they need. I tend to find that when using larger and larger datasets, I need to up the perplexity to get comparably good results, so your memory requirement scales worse than linearly with data size. I came across this thread precisely because I'm searching for a Spark implementation or some other solution to my memory problem. – gazza89 Mar 30 '23 at 12:48

0 Answers0