Writing output of the Principal Components Analysis to text file

Question

I have performed a Principal Component Analysis on a matrix I previously loaded with sc.textFile. The output being a org.apache.spark.mllib.linalg.Matrix I then converted it to a RDD[Vector[Double]].

with:

import java.io.PrintWriter

I did:

    val pw = new PrintWriter("Matrix.csv")
    rows3.collect().foreach(line => pw.println(line))
    pw.flush

The output csv is promising. the only problem is that each line is a DenseVector(some values). How do I split each line into the corresponding coefficients?

Thanks a lot

Take a look [here](http://stackoverflow.com/questions/29946190/how-to-change-rowmatrix-into-array-in-spark-or-export-it-as-a-csv/29946713#29946713)! — eliasah, Jul 06 '15 at 15:47

zero323 · Accepted Answer · 2015-07-07T11:32:51.667

1

You can use results of the computePrincipalComponents and breeze.linalg.csvwrite:

import java.io.File
import breeze.linalg.{DenseMatrix => BDM, csvwrite}

val mat: RowMatrix = ...
val pca = mat.computePrincipalComponents(...)

csvwrite(
    new File("Matrix.csv"),
    new BDM[Double](mat.numRows, mat.numCols, mat.toArray))

edited Jul 07 '15 at 11:32

answered Jul 06 '15 at 17:40

zero323

322,348
103
959
935

except that when I check the outputfile not all values are displayed; it is basically the same as the console output, for example the first line: "value1 value2 ... (523 total)" – fricadelle Jul 07 '15 at 10:50

lev · Answer 2 · 2015-07-06T16:16:29.113

0

convert each vector to a string (you can do it either on the driver or the executers)

val pw = new PrintWriter("Matrix.csv")
rows3.map(_.mkString(",")).collect().foreach(line => pw.println(line))
pw.flush

edit: if your data is too big to fit in the memory of the driver, you can try something like that:

val rdd = rows3.map(_.mkString(",")).zipWithIndex.cache
val total = rdd.count
val step = 10000 //rows in each chunk
val range = 0 to total by step
val limits = ranges.zip(range.drop(1))
limits.foreach { case(start, end) => 
                  rdd.filter(x => x._2 >= start && x._2 < end)
                     .map(_._1)
                     .collect
                     .foreach(pw.println(_))
}

I can't try this out, but that is the general idea

edited Jul 06 '15 at 16:16

answered Jul 06 '15 at 15:35

lev

3,986
4
33
46

This is not how it should be done. You should forget that you are trying to write with the boundaries of a distributed system. You ought running into some concurrency issues if this doesn't overhead the JVM! – eliasah Jul 06 '15 at 15:45
I know that it will work, but the whole purpose of Spark as a distributed system is its scalability. Plus you might run into some concurrency issues considering the non-atomic transformation over mutable data which is the output. – eliasah Jul 06 '15 at 16:12
2

There are no concurrency issues. `collect()` returns a local array and nothing mutable happens before that. And if you know the output data is small (but the intermediate data is large) this is a perfectly reasonable way to work with Spark. – lmm Jul 06 '15 at 16:15
Well, `computePrincipalComponents` returns `org.apache.spark.mllib.linalg.Matrix` and it is already a local data structure. There is really no reason to convert it to a RDD in the first place. – zero323 Jul 06 '15 at 17:48

Writing output of the Principal Components Analysis to text file

2 Answers2