3

After spending way to much time figuring out why I get the following error

pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

while trying to create a dataframe based on Rows and a Schema, I noticed the following:

With a Row inside my rdd called rrdRows looking as follows:

Row(a="1", b="2", c=3)

and my dfSchema defined as:

dfSchema = StructType([
        StructField("c", IntegerType(), True),
        StructField("a", StringType(), True),
        StructField("b", StringType(), True)
        ])

creating a dataframe as follows:

df = sqlContext.createDataFrame(rddRows, dfSchema)

brings the above mentioned Error, because Spark only considers the order of StructFields in the schema and does not match the name of the StructFields with the name of the Row fields.

In other words, in the above example, I noticed that spark tries to create a dataframe that would look as follow (if there would not be a typeError. e.x if everything would be of type String)

+---+---+---+
| c | b | a |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+

is this really expected, or some sort of bug?

EDIT: the rddRows are create along those lines:

def createRows(dic):
    res = Row(a=dic["a"],b=dic["b"],c=int(dic["c"])
    return res

rddRows = rddDict.map(createRows)

where rddDict is a parsed JSON file.

Kito
  • 1,375
  • 4
  • 17
  • 37
  • How are you creating your `rddRows`? – eliasah Nov 03 '15 at 14:01
  • The code is a bit to big for a comment, but I do along those lines: def createRows(dic): res = Row(a=dic["a"],b=dic["b"],c=int(dic["c"]) return res rddRows = rddDict.map(createRows) Where rddDict is a parsed JSON file. However, i also tried it with a different example but I got the same results. – Kito Nov 03 '15 at 14:43
  • the type is : . I use it inside spark streaming but I also observed the same issue in a very simple batch job. – Kito Nov 03 '15 at 14:56
  • 1
    Well, it looks like an expected behavior. PySpark `Row`, similarly to its Scala counterpart, is simply a tuple. It means it has fixed order of values and size. Everything else, like names or schema (in case of Scala version), is just a metadata. Since row can have no names at all or names in schema can be different than those in the rows the only reasonable matching is order. This is in contrast to for example JSON source where order is not meaningful and names are the only good way to match records. – zero323 Nov 03 '15 at 14:58
  • Hmmm ok. Thanks for the clarification. Maybe as a short follow up: Say I already have another dataframe with columns c,b,a which I want to append with the above created dataframe. What would be the best way to realize it? I thought of the .unionAll function. However, in order to use it I would nee the same column order for both dataframes, right? – Kito Nov 03 '15 at 15:03
  • Thats rights, this is how SQL union works. You can adjust Scala code I've provided [here](http://stackoverflow.com/a/32705507/1560062) but I think that explicit ordering is much cleaner unless you have a very large number of columns. BTW if you take JSON as an input why not use `SqlContext.read.json`? – zero323 Nov 03 '15 at 15:35
  • Yes, I ended up using a simple list like [3, 1, 2] instead of a Row. That way I can influence the order of the columns of the dataframe. Thanks for the `SqlContext.read.json` but I'm getting the JSON through a Socket inside Spark Streaming. As far as I know, `SqlContext.read.json` can only be used to read from a file, right? – Kito Nov 03 '15 at 19:06
  • As far as I know in PySpark yes. In Scala it can be `RDD[String]` as well.. – zero323 Nov 03 '15 at 21:18

1 Answers1

1

The constructor of the Row sorts the keys if you provide keyword arguments. Take a look at the source code here. When I found out about that, I ended up sorting my schema accordingly before applying it to the dataframe:

   sorted_fields = sorted(dfSchema.fields, key=lambda x: x.name)
   sorted_schema = StructType(fields=sorted_fields)
   df = sqlContext.createDataFrame(rddRows, sorted_schema)
architectonic
  • 2,871
  • 2
  • 21
  • 35