I'm trying to create a dataframe from an rdd. I want to specify schema explicitly. Below is the code snippet which I tried.
from pyspark.sql.types import StructField, StructType , LongType, StringType
stringJsonRdd_new = sc.parallelize(('{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }',\
'{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green" }',\
'{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'))
mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- eyeColor: string (nullable = true)
|-- name: string (nullable = true)
When I try new_df.show() , I get error as:
ValueError: Unexpected tuple '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }' with StructType
Can someone help me out?
PS: I could explicitly typecast and create a new df from existing df using:
casted_df = stringJsonDf.select(stringJsonDf.age,stringJsonDf.eyeColor, stringJsonDf.name,stringJsonDf.id.cast('int').alias('new_id'))