1

I'm trying to convert a dataframe to a breeze dense matrix using scala. I couldn't find any built-in functions to do this, so here's what I'm doing.

import scala.util.Random
import breeze.linalg.DenseMatrix

val featuresDF = (1 to 10)
    .map(_ => (
      Random.nextDouble,Random.nextDouble,Random.nextDouble))
    .toDF("F1", "F2", "F3")

var FeatureArray: Array[Array[Double]] = Array.empty
val features = featuresDF.columns

for(i <- features.indices){
    FeatureArray = FeatureArray :+ featuresDF.select(features(i)).collect.map(_(0).toString).map(_.toDouble)
}

val desnseMat = DenseMatrix(FeatureArray: _*).t

This does work fine and I get what I want. However, this causes OOM exceptions in my environment. Is there a better way of doing this conversion. My ultimate goal is to calculate the eigen values and eigen vectors of the features using the dense matrix.

import breeze.stats.covmat
import breeze.linalg.eig

val covariance = covmat(desnseMat)
val eigen = eig(covariance)

So, it would be even better if there's a direct way to get the eigen values and eigen vectors from the dataframe. PCA in spark ml must be doing this calculation using the features column. Is there a way to access eigen values through PCA?

Jay Traband
  • 17,053
  • 1
  • 23
  • 44
rasthiya
  • 650
  • 1
  • 6
  • 20

2 Answers2

1

First of all, try to increase your RAM.

Secondly, try one of these functions, using DenseMatrix in Spark. Both functions use the same amount of RAM on my computer.

I obtained 1,34 seconds for parsing 201238 rows in a DataFrame with 1 column each containing several Double values:

import org.apache.spark.mllib.linalg.DenseMatrix
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.DataFrame

def getDenseMatrixFromDF(featuresDF:DataFrame):DenseMatrix = {
    val featuresTrain = featuresDF.columns
    val rows = featuresDF.count().toInt

    val newFeatureArray:Array[Double] = featuresTrain
       .indices
       .flatMap(i => featuresDF
       .select(featuresTrain(i))
       .collect())
       .map(r => r.toSeq.toArray).toArray.flatten.flatMap(_.asInstanceOf[org.apache.spark.ml.linalg.DenseVector].values)

    val newCols = newFeatureArray.length / rows
    val denseMat:DenseMatrix = new DenseMatrix(rows, newCols, newFeatureArray, isTransposed=false)
    denseMat
}

If I want to get a DenseVector from a DataFrame with one column containing one Double value only, I got 0.8 seconds for the same amount of data :

import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.DataFrame

def getDenseVectorFromDF(featuresDF:DataFrame):DenseVector = {
    val featuresTrain = featuresDF.columns
    val cols = featuresDF.columns.length

    cols match {
      case i if i>1 => throw new IllegalArgumentException
      case _ => {
        def addArray(acc:Array[Array[Double]],cur:Array[Double]):Array[Array[Double]] = {
          acc :+ cur
        }

        val newFeatureArray:Array[Double] = featuresTrain
          .indices
          .flatMap(i => featuresDF
          .select(featuresTrain(i))
          .collect())
          .map(r => r.toSeq.toArray.map(e => e.asInstanceOf[Double])).toArray.flatten

        val denseVec:DenseVector = new DenseVector(newFeatureArray)
        denseVec
   }
}

To compute eigenvalues/eigenvectors just check this link and this API link

To compute covariance matrix chek this link and this API link

Catalina Chircu
  • 1,506
  • 2
  • 8
  • 19
1
def getDenseMatrixFromDF(featuresDF:DataFrame):BDM[Double] = {
    val featuresTrain = featuresDF.columns
    val cols = featuresTrain.length
    val rows = featuresDF.count().toInt
    val denseMat: BDM[Double] = BDM.tabulate(rows,cols)((i, j)=>{
        featuresDF.collect().apply(i).getAs[Double](j)
        })
    denseMat
  }
eric_huke
  • 11
  • 1