0

I have the following RDD of Rows. As can be seen each field is a string type

[Row(A='6', B='1', C='hi'),
 Row(A='4', B='5', C='bye'),
 Row(A='8', B='9', C='night')]

I want to convert this RDD into a dataframe with IntegerTypes for column A and B

dtypes = [
    StructField('A', IntegerType(), True),
    StructField('B', IntegerType(), True),
    StructField('C', StringType(), True)
]

df = spark.createDataFrame(rdd, StructType(dtypes))

I get the following error:

TypeError: field A: IntegerType can not accept 
object '6' in type <class 'str'>

How can i succesfully convert '6' into an IntegerType?

vi_ral
  • 369
  • 4
  • 19
  • Possible duplicate : https://stackoverflow.com/questions/46956026/how-to-convert-column-with-string-type-to-int-form-in-pyspark-data-frame – Mahendra Singh Meena May 08 '20 at 17:45
  • I saw that post. It is dealing directly with converting column types in a spark DF, not converting column types when creating dataframe from RDD – vi_ral May 08 '20 at 17:49
  • okay for that you need to modify your RDD of rows so that all those string data is casted to integer before you create the dataframe. – Mahendra Singh Meena May 08 '20 at 17:51

1 Answers1

1

You should modify the RDD of rows before you create a dataframe of desired column type.

def modify_row(row):
    new_row = {}
    for key in row:
        if key in ['A', 'B']:
             new_row[key] = int(row[key])
        else:
             new_row[key] = row[key]
    return new_row

rdd = (sc.parallelize([Row(A='6', B='1', C='hi'),
                      Row(A='4', B='5', C='bye'),
                      Row(A='8', B='9', C='night')])
         .map(lambda x: modify_row(x)))

dtypes = [
    StructField('A', IntegerType(), True),
    StructField('B', IntegerType(), True),
    StructField('C', StringType(), True)
]

df = spark.createDataFrame(rdd, StructType(dtypes))

  • I like this solution. Was thinking the same. Kind of sucks that pyspark doesnt convert those columns to the desired type for you ... Will mark your solution as correct once I get this working. Thanks – vi_ral May 08 '20 at 17:58