After spending way to much time figuring out why I get the following error
pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
while trying to create a dataframe based on Rows and a Schema, I noticed the following:
With a Row inside my rdd called rrdRows looking as follows:
Row(a="1", b="2", c=3)
and my dfSchema defined as:
dfSchema = StructType([
StructField("c", IntegerType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True)
])
creating a dataframe as follows:
df = sqlContext.createDataFrame(rddRows, dfSchema)
brings the above mentioned Error, because Spark only considers the order of StructFields in the schema and does not match the name of the StructFields with the name of the Row fields.
In other words, in the above example, I noticed that spark tries to create a dataframe that would look as follow (if there would not be a typeError. e.x if everything would be of type String)
+---+---+---+
| c | b | a |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
is this really expected, or some sort of bug?
EDIT: the rddRows are create along those lines:
def createRows(dic):
res = Row(a=dic["a"],b=dic["b"],c=int(dic["c"])
return res
rddRows = rddDict.map(createRows)
where rddDict is a parsed JSON file.