Is it worth converting 64bit integers to 32bit (of 16bit) ints in a spark dataframe?

Question

I have a dataframe that contains ~4bn records. Many of the columns are 64bit ints, but could be truncated into 32bit or 16bit ints without data loss. When I try converting the data types using the following function:

def switchType(df, colName):
    df = df.withColumn( colName + "SmallInt", df[colName].cast(ShortType()))
        df = df.drop(colName)
        return df.withColumnRenamed(colName + 'SmallInt', colName)

positionsDf = switchType(positionsDf, "FundId")
# repeat for 4 more cols...
print(positionsDf.cache().count())

This shows as taking 54.7 MB in ram. When I don't do this, it shows as 56.7MB in ram.

So, is it worth trying to truncate ints at all?

I am using Spark 2.01 in stand alone mode.

score 1 · Answer 1 · answered Nov 15 '16 at 17:43

1

If you plan to write it in format that saves numbers in binary (parquet, avro) it may save some space. For calculations there will be probably no difference in speed.

answered Nov 15 '16 at 17:43

Mariusz

13,481
3
60
64

Doesn't Spark take advantage of SSE and similar instructions? – George Sovetov Nov 15 '16 at 21:52
Spark uses only what JVM can give. In case of Java there is no really speed boost by changing numerical types: http://stackoverflow.com/questions/2380696/java-short-integer-long-performance – Mariusz Nov 16 '16 at 05:09

score 0 · Answer 2 · answered Nov 16 '16 at 09:18

Ok, for the benefit of anyone else that stumbles across this. If I understand it, it depends on your JVM implementation (so, machine/OS specific), but in my case it makes little difference. I'm running java 1.8.0_102 on RHEL 7 64bit.

I tried it with a larger dataframe (3tn+ records). The dataframe contains 7 coulmns of type short/long, and 2 as doubles:

As longs - 59.6Gb
As shorts - 57.1Gb

The tasks I used to create this cached dataframe also showed no real difference in execution time.

What is nice to note is that the storage size does seem to scale linearly with the number of records. So that is good.

Is it worth converting 64bit integers to 32bit (of 16bit) ints in a spark dataframe?

2 Answers2