1

I have a data frame doubleSeq whose structure is as below

res274: org.apache.spark.sql.DataFrame = [finalFeatures: vector]

The first record of the column is as follows

res281: org.apache.spark.sql.Row = [[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]]

I want to extract the double array

[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]

from this -

doubleSeq.head(1)(0)(0)

gives

Any = [3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]

Which is not solving my problem

Scala Spark - split vector column into separate columns in a Spark DataFrame

Is not solving my issue but its an indicator

Leothorn
  • 1,345
  • 1
  • 23
  • 45

1 Answers1

3

So you want to extract a Vector from a Row, and turn it into an array of doubles.

The problem with your code is that the get method (and the implicit apply method you are using) returns an object of type Any. Indeed, a Row is a generic, unparametrized object and there is no way to now at compile time what types it contains. It's a bit like Lists in java 1.4 and before. To solve it in spark, you can use the getAs method that you can parametrize with a type of your choosing.

In your situation, you seem to have a dataframe containing a vector (org.apache.spark.ml.linalg.Vector).

import org.apache.spark.ml.linalg._
val firstRow = df.head(1)(0) // or simply df.head
val vect : Vector = firstRow.getAs[Vector](0)
// or all in one: df.head.getAs[Vector](0)

// to transform into a regular array
val array : Array[Double] = vect.toArray

Note also that you can access columns by name like this:

val vect : Vector = firstRow.getAs[Vector]("finalFeatures")
Oli
  • 9,766
  • 5
  • 25
  • 46
  • How do i convert from wrapped array to normal double array ? – Leothorn Apr 08 '19 at 07:48
  • `.toArray`, I edited the answer to make it clearer. – Oli Apr 08 '19 at 08:06
  • value getAs is not a member of org.apache.spark.sql.DataFrame -> which import statement are you using ? – Leothorn Apr 08 '19 at 08:08
  • It's a method of Row, not dataframe – Oli Apr 08 '19 at 08:08
  • I am getting an error : java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to scala.collection.Seq ... 54 elided – Leothorn Apr 08 '19 at 08:11
  • Error for this code : import org.apache.spark.mllib.linalg.Vectors result.select("finalFeatures").head(1)(0).getAs[Seq[Double]](0) – Leothorn Apr 08 '19 at 08:12
  • Oh, I see. Can you give me the result of `dataframe.printSchema`? – Oli Apr 08 '19 at 08:13
  • root |-- finalFeatures: vector (nullable = true) – Leothorn Apr 08 '19 at 08:15
  • I edited my answer to take into account all this. It should work fine now ;) – Oli Apr 08 '19 at 08:26
  • java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.Vector ... 76 elided ..sorry i get this error ...are there two errors here ? – Leothorn Apr 08 '19 at 08:39
  • try `getAs[DenseVector]` then... but I'm surprised this does not work... It could be interesting that you add in your question more information about how you generated this dataframe. – Oli Apr 08 '19 at 08:53
  • I created this dataframe using vector assembler - The suggestion you made failed with a wierd error - import org.apache.spark.mllib.linalg.Vector java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.DenseVector – Leothorn Apr 08 '19 at 09:35
  • It worked Oli - getAs[org.apache.spark.ml.linalg.Vector] use the full path – Leothorn Apr 08 '19 at 09:38
  • OK, now i know what's wrong. You're using SparkML, I googled the doc and landed on the old spark mllib. I'll fix the answer again :) – Oli Apr 08 '19 at 09:39
  • import org.apache.spark.ml.linalg.Vectors var x = df.head(1)(0).getAs[org.apache.spark.ml.linalg.Vector](0).toArray x.length - purrfect – Leothorn Apr 08 '19 at 09:39
  • Great :) The answer is fixed! – Oli Apr 08 '19 at 09:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/191443/discussion-between-leothorn-and-oli). – Leothorn Apr 08 '19 at 10:47