0

I have a dataframe that contains ~4bn records. Many of the columns are 64bit ints, but could be truncated into 32bit or 16bit ints without data loss. When I try converting the data types using the following function:

def switchType(df, colName):
    df = df.withColumn( colName + "SmallInt", df[colName].cast(ShortType()))
        df = df.drop(colName)
        return df.withColumnRenamed(colName + 'SmallInt', colName)

positionsDf = switchType(positionsDf, "FundId")
# repeat for 4 more cols...
print(positionsDf.cache().count())

This shows as taking 54.7 MB in ram. When I don't do this, it shows as 56.7MB in ram.

So, is it worth trying to truncate ints at all?

I am using Spark 2.01 in stand alone mode.

ThatDataGuy
  • 1,969
  • 2
  • 17
  • 43

2 Answers2

1

If you plan to write it in format that saves numbers in binary (parquet, avro) it may save some space. For calculations there will be probably no difference in speed.

Mariusz
  • 13,481
  • 3
  • 60
  • 64
  • Doesn't Spark take advantage of SSE and similar instructions? – George Sovetov Nov 15 '16 at 21:52
  • Spark uses only what JVM can give. In case of Java there is no really speed boost by changing numerical types: http://stackoverflow.com/questions/2380696/java-short-integer-long-performance – Mariusz Nov 16 '16 at 05:09
0

Ok, for the benefit of anyone else that stumbles across this. If I understand it, it depends on your JVM implementation (so, machine/OS specific), but in my case it makes little difference. I'm running java 1.8.0_102 on RHEL 7 64bit.

I tried it with a larger dataframe (3tn+ records). The dataframe contains 7 coulmns of type short/long, and 2 as doubles:

  • As longs - 59.6Gb
  • As shorts - 57.1Gb

The tasks I used to create this cached dataframe also showed no real difference in execution time.

What is nice to note is that the storage size does seem to scale linearly with the number of records. So that is good.

ThatDataGuy
  • 1,969
  • 2
  • 17
  • 43