1

I have huge data coming from multiple files (100M rows, 10K columns). Except the first column, all others are floats, and each column from the input corresponds to a sample that needs to be clustered. Unfortunately, this means I need to transpose the dataframe. To make matters worse, the pivoting requires groupBy which will lead to spurious data as far as I can see. A small sample of the model system is shown below:

import org.apache.spark.sql.types._

val columns = Seq("Name", "X1", "X2", "X3", "X4")
val data = Seq(("id1", "1", "2", "3", "4"),("id2", "2", "2", "1", "8"),("id3", "1", "2", "5", "8"))

val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd).toDF(columns:_*)
df.show()

for (col <- df.columns.drop(1)) {
     df = df.withColumn(col, df(col) cast(FloatType))
}
df.show()

+----+---+---+---+---+
|Name| X1| X2| X3| X4|
+----+---+---+---+---+
| id1|1.0|2.0|3.0|4.0|
| id2|2.0|2.0|1.0|8.0|
| id3|1.0|2.0|5.0|8.0|
+----+---+---+---+---+

This is my first attempt with mllib, and from what I could find, KMeans features requires that samples are given as rows as shown below, but that does not seem possible here. I did not find any easy way to do that.

+----+----+----+----+
|Name| id1| id2| id3|
+----+----+----+----+
|  X1| 1.0| 2.0| 1.0|
|  X2| 2.0| 2.0| 2.0|
|  X3| 2.0| 1.0| 5.0|
|  X4| 1.0| 8.0| 8.0|
+----+----+----+----+

What I also considered was to employ pyspark instead to use numpy arrays to transpose (through df.to_numpy()). But this involves huge penalty in loading itself for the data.

I believe that there must be a better way of doing this (maybe some libraries such as breeze), and I seek guidance in solving this problem.

PS: I looked at DenseMatrix and it seems to require collecting all the data into a single node. Given the size, that did not look like a good solution.

Quiescent
  • 1,088
  • 7
  • 18

0 Answers0