I've done a fair amount of searching and haven't come across any solid info regarding the use of the data.table package in the Databricks environment. Myself and other colleagues have carried out tests in Databricks trying to use the data.table fread function to read in a relatively large csv (about 15gb). The fread function takes a very long time (we've never actually run it to completion) but when running on our own laptops (16gb ram) it takes roughly 1-2 minutes.
Additionally to the example above, I've read in a relatively small 34mb csv with read.csv and fread. The run times are below:
- read.csv: 8 seconds
- fread: 25 seconds
As for cluster configuration, we're running the fread function on a single node cluster with 32 cores and 128gb of memory.
Does anyone have any suggestions for why data.table performs so poorly in the Databricks environment? I understand that this isn't really the best use of Databricks and that we should switch to SparkR for performance purposes but our agency has many users that would stand to benefit from being able to leverage the platform with their existing R code base and not having to tweak it too much.