1

I'm trying to create a dataframe from an rdd. I want to specify schema explicitly. Below is the code snippet which I tried.

from pyspark.sql.types import StructField, StructType , LongType, StringType

stringJsonRdd_new = sc.parallelize(('{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown"  }',\
'{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green"  }',\
'{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'))

mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)
 |-- eyeColor: string (nullable = true)
 |-- name: string (nullable = true)

When I try new_df.show() , I get error as:

ValueError: Unexpected tuple '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown"  }' with StructType

Can someone help me out?

PS: I could explicitly typecast and create a new df from existing df using:

casted_df = stringJsonDf.select(stringJsonDf.age,stringJsonDf.eyeColor, stringJsonDf.name,stringJsonDf.id.cast('int').alias('new_id'))
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Sumit
  • 1,360
  • 3
  • 16
  • 29

1 Answers1

4

You are giving the dataframe string as input instead of dictionaries, thus it cannot map it to the types you have defined.

If you modify your code as below (also changing the "id" in the data to numeric and not string - or alternatively change the struct type for "id" from LongType to StringType):

from pyspark.sql.types import StructField, StructType , LongType, StringType

# give dictionaries instead of strings:
stringJsonRdd_new = sc.parallelize((
{"id": 123, "name": "Katie", "age": 19, "eyeColor": "brown"  },\
{ "id": 234,"name": "Michael", "age": 22, "eyeColor": "green"  },\
{ "id": 345, "name": "Simone", "age": 23, "eyeColor": "blue" }))

mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])

new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()


root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)
 |-- eyeColor: string (nullable = true)
 |-- name: string (nullable = true)

+---+---+--------+-------+
| id|age|eyeColor|   name|
+---+---+--------+-------+
|123| 19|   brown|  Katie|
|234| 22|   green|Michael|
|345| 23|    blue| Simone|
+---+---+--------+-------+

Hope this helps, good luck!

mkaran
  • 2,528
  • 20
  • 23