createDataFrame() failing, can't figure out why

Question

I'm trying to create a dataframe using what seems to be the canonical "hello world" of creating spark dataframes and cannot fathom why its failing. Help!

from pyspark.sql.types import *
schema = StructType([StructField("product", StringType(), True)])
l = [('foo')]
rdd = sc.parallelize(l)
df = sqlContext.createDataFrame(rdd, schema)
df.show()

Above code throws below error

ValueError: Unexpected tuple 'foo' with StructType`

The code is basically lifted straight out of pyspark.sql module so i am completely stumped.

Possible duplicate of [Create Spark DataFrame. Can not infer schema for type: ](http://stackoverflow.com/questions/32742004/create-spark-dataframe-can-not-infer-schema-for-type-type-float) — , Dec 09 '16 at 14:29

T. Gawęda · Accepted Answer · 2016-12-09T13:38:37.970

1

That's because createDataFrame requires RDD[Row] as an argument:

df = sqlContext.createDataFrame(rdd.map (lambda x: Row(x)), schema)

Will give you correct DataFrame

Full code, tested Spark 1.6:

from pyspark.sql.types import *

schema = StructType([StructField("product", StringType(), True)])
l = [('foo')]
rdd = sc.parallelize(l)

df = sqlContext.createDataFrame(rdd.map (lambda x: Row(x)), schema)
df.show()

edited Dec 09 '16 at 13:38

answered Dec 09 '16 at 13:28

T. Gawęda

15,706
4
46
61

cool. I'm running in Jupyter so had to add `from pyspark.sql import Row` to make it work but that's perfect, thank you. – jamiet Dec 09 '16 at 13:51

createDataFrame() failing, can't figure out why

1 Answers1