I have a dataframe that contains ~4bn records. Many of the columns are 64bit ints, but could be truncated into 32bit or 16bit ints without data loss. When I try converting the data types using the following function:
def switchType(df, colName):
df = df.withColumn( colName + "SmallInt", df[colName].cast(ShortType()))
df = df.drop(colName)
return df.withColumnRenamed(colName + 'SmallInt', colName)
positionsDf = switchType(positionsDf, "FundId")
# repeat for 4 more cols...
print(positionsDf.cache().count())
This shows as taking 54.7 MB in ram. When I don't do this, it shows as 56.7MB in ram.
So, is it worth trying to truncate ints at all?
I am using Spark 2.01 in stand alone mode.