pySpark Create DataFrame from RDD with Key/Value

Question

If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example:

(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)

And have the dataframe look like:

1,2,18
1,10,18
2,20,18

score 11 · Accepted Answer · edited Mar 07 '17 at 16:37

11

Yes it's possible (tested with Spark 1.3.1) :

>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]

edited Mar 07 '17 at 16:37

chrisaycock

36,470
14
88
125

answered May 02 '15 at 20:43

Olivier Girardot

4,618
6
28
29

Is this equivolent to `rdd.toDF( ["id", "score"])`? – Frozen Flame May 17 '16 at 07:22
'RDD' object has no attribute 'toDF' . Facing this error – Jack Daniel Sep 15 '16 at 07:40
I am using 1.6 spark and pyspark. Unable to load the sql.SQLContext and create DataFrame out of it. – Jack Daniel Sep 15 '16 at 07:55

score 0 · Answer 2 · edited Feb 10 '17 at 06:19

0

rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])

df=rdd.toDF(['id','score'])

df.show()

answer is:

+---+-----+
| id|score|
+---+-----+
|  0|    1|
|  0|    1|
|  0|    2|
|  1|    2|
|  1|   10|
|  1|   20|
|  3|   18|
|  3|   18|
|  3|   18|
+---+-----+

edited Feb 10 '17 at 06:19

S.I.

3,250
12
48
77

answered Feb 10 '17 at 04:39

srinivasu

11
5

pySpark Create DataFrame from RDD with Key/Value

2 Answers2