1

I am writing a project for Spark 1.4 in Scala and am currently in between converting my initial input data into spark.mllib.linalg.Vectors and scala.immutable.Vector that I later want to work with in my algorithm. Could someone briefly explain the difference between the two and in what situation one would be more useful to use than the other?

Thank you.

zero323
  • 322,348
  • 103
  • 959
  • 935
Sasha
  • 109
  • 2
  • 12

1 Answers1

3

spark.mllib.linalg.Vector is designed for linear algebra applications. mllib provides two different implementations - DenseVector, SparseVector. While you have access to useful methods like norm or sqdist it is rather limited otherwise.

As all data structures from org.apache.spark.mllib.linalg it can store only 64-bit floating point numbers (scala.Double).

If you plan to use mllib then spark.mllib.linalg.Vector is pretty much your only option. All the remaining data structures from mllib, both local and distributed, are build on top of org.apache.spark.mllib.linalg.Vector.

Otherwise, scala.immutable.Vector is probably a much better choice. It is a general purpose, dense data structure.

It can store objects of any type, so you can have Vector[String] for example.

Since it is Traversable you have access to all expected methods like map, flatMap, reduce, fold, filter, etc.

Edit: If you need algebraic operations and don't use any of the data structures from org.apache.spark.mllib.linalg.distributed you may prefer breeze.linalg.Vector over spark.mllib.linalg.Vector. It supports larger set of the algebraic methods including dot product and provides typical collection API.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    @zero323 cool comment deleted so now only the NSA knows and I'll delete this comment soon as well :) – JimLohse Feb 23 '16 at 13:26