8

If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example:

(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)

And have the dataframe look like:

1,2,18
1,10,18
2,20,18
Olivier Girardot
  • 4,618
  • 6
  • 28
  • 29
theMadKing
  • 2,064
  • 7
  • 32
  • 59

2 Answers2

11

Yes it's possible (tested with Spark 1.3.1) :

>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]
chrisaycock
  • 36,470
  • 14
  • 88
  • 125
Olivier Girardot
  • 4,618
  • 6
  • 28
  • 29
0
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])

df=rdd.toDF(['id','score'])

df.show()

answer is:

+---+-----+
| id|score|
+---+-----+
|  0|    1|
|  0|    1|
|  0|    2|
|  1|    2|
|  1|   10|
|  1|   20|
|  3|   18|
|  3|   18|
|  3|   18|
+---+-----+
S.I.
  • 3,250
  • 12
  • 48
  • 77
srinivasu
  • 11
  • 5