0

I'm trying to create a dataframe using what seems to be the canonical "hello world" of creating spark dataframes and cannot fathom why its failing. Help!

from pyspark.sql.types import *
schema = StructType([StructField("product", StringType(), True)])
l = [('foo')]
rdd = sc.parallelize(l)
df = sqlContext.createDataFrame(rdd, schema)
df.show()

Above code throws below error

ValueError: Unexpected tuple 'foo' with StructType`

The code is basically lifted straight out of pyspark.sql module so i am completely stumped.

abaghel
  • 14,783
  • 2
  • 50
  • 66
jamiet
  • 10,501
  • 14
  • 80
  • 159
  • Possible duplicate of [Create Spark DataFrame. Can not infer schema for type: ](http://stackoverflow.com/questions/32742004/create-spark-dataframe-can-not-infer-schema-for-type-type-float) –  Dec 09 '16 at 14:29

1 Answers1

1

That's because createDataFrame requires RDD[Row] as an argument:

df = sqlContext.createDataFrame(rdd.map (lambda x: Row(x)), schema)

Will give you correct DataFrame

Full code, tested Spark 1.6:

from pyspark.sql.types import *

schema = StructType([StructField("product", StringType(), True)])
l = [('foo')]
rdd = sc.parallelize(l)

df = sqlContext.createDataFrame(rdd.map (lambda x: Row(x)), schema)
df.show()
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • cool. I'm running in Jupyter so had to add `from pyspark.sql import Row` to make it work but that's perfect, thank you. – jamiet Dec 09 '16 at 13:51