0

I have the following RDD:

rdd.take(5) gives me:

[DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
 DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
 DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]),
 DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
 DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699])]

I would like to make it a data frame which should look like:

-------------------------------------------------------------------
| features                                                        |
-------------------------------------------------------------------
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------| 
| [5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]             |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|

Is this possible? I tried to use df_new = sqlContext.createDataFrame(rdd,['features']) , but it didn't work. Does anyone have any suggestion? Thanks!

zero323
  • 322,348
  • 103
  • 959
  • 935
Edamame
  • 23,718
  • 73
  • 186
  • 320

1 Answers1

4

Map to tuples first:

rdd.map(lambda x: (x, )).toDF(["features"])

Just keep in mind that as of Spark 2.0 there are two different Vector implementation an ml algorithms require pyspark.ml.Vector.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks! map(lambda x: (x, )) looks very mysterious, would you please elaborate more? Thank you! – Edamame Sep 18 '16 at 14:25
  • `(x, )` is a single element `tuple`. Mapping is required because only [some objects can be converted to `Row`](http://stackoverflow.com/a/32742294/1560062) – zero323 Sep 18 '16 at 14:26