Programmatically specifying the schema in PySpark

Question

I'm trying to create a dataframe from an rdd. I want to specify schema explicitly. Below is the code snippet which I tried.

from pyspark.sql.types import StructField, StructType , LongType, StringType

stringJsonRdd_new = sc.parallelize(('{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown"  }',\
'{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green"  }',\
'{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'))

mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])
new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)
 |-- eyeColor: string (nullable = true)
 |-- name: string (nullable = true)

When I try new_df.show() , I get error as:

ValueError: Unexpected tuple '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown"  }' with StructType

Can someone help me out?

PS: I could explicitly typecast and create a new df from existing df using:

casted_df = stringJsonDf.select(stringJsonDf.age,stringJsonDf.eyeColor, stringJsonDf.name,stringJsonDf.id.cast('int').alias('new_id'))

I was trying out an example(with slight modification) from the book I'm going through! Just for learning purpose :) — Sumit, Feb 01 '18 at 13:54
You'd typically go with JSON reader and load data directly from file. And if it is just learning, then you should really update you Spark version. Learning 1.6 in 2018 doesn't make any sense. — Alper t. Turker, Feb 01 '18 at 13:55

score 4 · Accepted Answer · answered Feb 01 '18 at 11:16

You are giving the dataframe string as input instead of dictionaries, thus it cannot map it to the types you have defined.

If you modify your code as below (also changing the "id" in the data to numeric and not string - or alternatively change the struct type for "id" from LongType to StringType):

from pyspark.sql.types import StructField, StructType , LongType, StringType

# give dictionaries instead of strings:
stringJsonRdd_new = sc.parallelize((
{"id": 123, "name": "Katie", "age": 19, "eyeColor": "brown"  },\
{ "id": 234,"name": "Michael", "age": 22, "eyeColor": "green"  },\
{ "id": 345, "name": "Simone", "age": 23, "eyeColor": "blue" }))

mySchema = StructType([StructField("id", LongType(), True), StructField("age", LongType(), True), StructField("eyeColor", StringType(), True), StructField("name", StringType(),True)])

new_df = sqlContext.createDataFrame(stringJsonRdd_new,mySchema)
new_df.printSchema()


root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)
 |-- eyeColor: string (nullable = true)
 |-- name: string (nullable = true)

+---+---+--------+-------+
| id|age|eyeColor|   name|
+---+---+--------+-------+
|123| 19|   brown|  Katie|
|234| 22|   green|Michael|
|345| 23|    blue| Simone|
+---+---+--------+-------+

Hope this helps, good luck!

Programmatically specifying the schema in PySpark

1 Answers1