1

I have a large data file (around 4 GB) and I am analyzing it using spark on a single pc.

scala> x
res29: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@5a86096a

scala> x.numRows
res27: Long = 302529

scala> x.numCols
res28: Long = 1828

When I try to compute the principal components I get a memory error:

scala> val pc: Matrix = x.computePrincipalComponents(2)

     15/03/30 14:55:22 INFO ContextCleaner: Cleaned shuffle 1
    java.lang.OutOfMemoryError: Java heap space
        at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:92)
        at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
        at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:38)
        at breeze.generic.UFunc$class.apply(UFunc.scala:48)
        at breeze.linalg.svd$.apply(svd.scala:22)
        at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:380)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)

How can I solve that?

zero323
  • 322,348
  • 103
  • 959
  • 935
Donbeo
  • 17,067
  • 37
  • 114
  • 188

1 Answers1

0

If you happen to have more RAM than Spark currently utilizes, you can try to increase the Java heap size with the command-line option --driver-memory 8g (assuming "local" mode here, in which the calculation is done by the driver program). Default is only 512m.

stholzm
  • 3,395
  • 19
  • 31