I have huge data coming from multiple files (100M rows, 10K columns). Except the first column, all others are floats, and each column from the input corresponds to a sample that needs to be clustered. Unfortunately, this means I need to transpose the dataframe. To make matters worse, the pivot
ing requires groupBy
which will lead to spurious data as far as I can see. A small sample of the model system is shown below:
import org.apache.spark.sql.types._
val columns = Seq("Name", "X1", "X2", "X3", "X4")
val data = Seq(("id1", "1", "2", "3", "4"),("id2", "2", "2", "1", "8"),("id3", "1", "2", "5", "8"))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd).toDF(columns:_*)
df.show()
for (col <- df.columns.drop(1)) {
df = df.withColumn(col, df(col) cast(FloatType))
}
df.show()
+----+---+---+---+---+
|Name| X1| X2| X3| X4|
+----+---+---+---+---+
| id1|1.0|2.0|3.0|4.0|
| id2|2.0|2.0|1.0|8.0|
| id3|1.0|2.0|5.0|8.0|
+----+---+---+---+---+
This is my first attempt with mllib, and from what I could find, KMeans
features requires that samples are given as rows as shown below, but that does not seem possible here. I did not find any easy way to do that.
+----+----+----+----+
|Name| id1| id2| id3|
+----+----+----+----+
| X1| 1.0| 2.0| 1.0|
| X2| 2.0| 2.0| 2.0|
| X3| 2.0| 1.0| 5.0|
| X4| 1.0| 8.0| 8.0|
+----+----+----+----+
What I also considered was to employ pyspark instead to use numpy
arrays to transpose (through df.to_numpy()
). But this involves huge penalty in loading itself for the data.
I believe that there must be a better way of doing this (maybe some libraries such as breeze), and I seek guidance in solving this problem.
PS:
I looked at DenseMatrix
and it seems to require collecting all the data into a single node. Given the size, that did not look like a good solution.