14

I need addition of two matrices that are stored in two files.

The content of latest1.txt and latest2.txt has the next str:

1 2 3
4 5 6
7 8 9

I am reading those files as follows:

scala> val rows = sc.textFile(“latest1.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r1 = rows
r1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

scala> val rows = sc.textFile(“latest2.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r2 = rows
r2: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

I want to add r1, r2. So, Is there any way to add this two RDD[mllib.linalg.Vector]s in Apache-Spark.

zero323
  • 322,348
  • 103
  • 959
  • 935
krishna
  • 177
  • 19
  • 30
  • 60
  • Zip the two RDDs together, then map over the resulting RDD – The Archetypal Paul Jan 30 '15 at 10:25
  • yeah i did like that val rdd3=rdd1.zip(rdd2) scala> val rdd4 = rdd3.map{ e => e._1 + e._2} and i am getting error :22: error: type mismatch; found : org.apache.spark.mllib.linalg.Vector required: String val r4=r3.map{e=>e._1 + e._2} since there no + or add operation on mllib vectors the addition operation is defined on util.Vectors – krishna Jan 30 '15 at 10:27
  • Looks like + isn't the operator to add two Vectors so you're getting the default implict that tries to convert to String. – The Archetypal Paul Jan 30 '15 at 10:31
  • yeah ,but i couldn't find any function or operator that perform addition. – krishna Jan 30 '15 at 10:41

2 Answers2

20

This is actually a good question. I work with mllib regularly and did not realize these basic linear algebra operations are not easily accessible.

The point is that the underlying breeze vectors have all of the linear algebra manipulations you would expect - including of course basic element wise addition that you specifically mentioned.

However the breeze implementation is hidden from the outside world via:

[private mllib]

So then, from the outside world/public API perspective, how do we access those primitives?

Some of them are already exposed: e.g. sum of squares:

/**
 * Returns the squared distance between two Vectors.
 * @param v1 first Vector.
 * @param v2 second Vector.
 * @return squared distance between two Vectors.
 */
def sqdist(v1: Vector, v2: Vector): Double = { 
  ...
}

However the selection of such available methods is limited - and in fact does not include the basic operations including element wise addition, subtraction, multiplication, etc.

So here is the best I could see:

  • Convert the vectors to breeze:
  • Perform the vector operations in breeze
  • Convert back from breeze to mllib Vector

Here is some sample code:

val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)

val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • Yes. `MLlib` is not a complete linear algebra library, `Breeze` should be used if such operations are needed. – Shyamendra Solanki Feb 02 '15 at 11:00
  • 2
    But what if the vector is sparse. I am currently manipulating the sparse vector. But if using your way to convert the vector, it will cost much more memory and slow down the calculating speed. It is weird that pyspark can do this operation easily. So I am thinking using python instead. – hidemyname Sep 08 '15 at 13:03
  • This is actually what I am trying.. [but what am I doing wrong here](http://stackoverflow.com/questions/36581220/why-can-i-only-retrieve-arrayfloat-word-vectors-but-have-to-pass-mllib-linalg)? – Stefan Falk Apr 12 '16 at 18:52
  • @displayname I answered that question. – WestCoastProjects Apr 12 '16 at 22:57
  • @HahaTTpro did you `import org.apache.spark.ml.linalg.DenseVector` ? – WestCoastProjects Nov 08 '17 at 19:29
  • 1
    @javadba How much do you think the performance will be affected when dealing with Sparse vectors? I'm dealing with Spark vectors of length `2**20` and I can't seem to find an efficient way to deal with this in Scala. – Sai Kiriti Badam Feb 27 '18 at 09:31
3

The following code exposes asBreeze and fromBreeze methods from Spark. This solution supports SparseVector in contrast to using vector.toArray. Note that Spark may change their API in the future and already has renamed toBreeze to asBreeze.

package org.apache.spark.mllib.linalg
import breeze.linalg.{Vector => BV}
import org.apache.spark.sql.functions.udf

/** expose vector.toBreeze and Vectors.fromBreeze
  */
object VectorUtils {

  def fromBreeze(breezeVector: BV[Double]): Vector = {
    Vectors.fromBreeze( breezeVector )
  }

  def asBreeze(vector: Vector): BV[Double] = {
    // this is vector.asBreeze in Spark 2.0
    vector.toBreeze
  }

  val addVectors = udf {
    (v1: Vector, v2: Vector) => fromBreeze( asBreeze(v1) + asBreeze(v2) )
  }

}

With this you can do df.withColumn("xy", addVectors($"x", $"y")).

Jussi Kujala
  • 901
  • 9
  • 7
  • Shouldn't the first line be `import org.apache.spark.mllib.linalg._` instead of a package definition? If I use as is I get an error saying "illegal start of definition". – Scott H Jul 09 '18 at 20:23
  • @scottH no, because functions need to be part of the package to get access to private functions. The code worked fine in Spark 1.6.1, but Spark 2+ has changed things. Did you try to compile the code into JAR instead of copy pasting to spark-shell? – Jussi Kujala Jul 26 '18 at 20:37