1

I've done a fair amount of searching and haven't come across any solid info regarding the use of the data.table package in the Databricks environment. Myself and other colleagues have carried out tests in Databricks trying to use the data.table fread function to read in a relatively large csv (about 15gb). The fread function takes a very long time (we've never actually run it to completion) but when running on our own laptops (16gb ram) it takes roughly 1-2 minutes.

Additionally to the example above, I've read in a relatively small 34mb csv with read.csv and fread. The run times are below:

  • read.csv: 8 seconds
  • fread: 25 seconds

As for cluster configuration, we're running the fread function on a single node cluster with 32 cores and 128gb of memory.

Does anyone have any suggestions for why data.table performs so poorly in the Databricks environment? I understand that this isn't really the best use of Databricks and that we should switch to SparkR for performance purposes but our agency has many users that would stand to benefit from being able to leverage the platform with their existing R code base and not having to tweak it too much.

Foxhound013
  • 301
  • 3
  • 13
  • 3
    (1) Your benchmarking of 8 vs 25 seconds is counter to all of my experience with `fread`, it would be well-informed with something reproducible. If it's being used correctly, then it's likely a bug. (2) The statements *"a very long time"* and *"1-2 minutes"* seem at odds for a 15gb file, and both are completely confounded by *"never run it to completion"*. While I don't doubt that there are challenges here, I suggest this question needs concrete examples to back up its claims. – r2evans Jan 11 '22 at 16:30
  • 1
    I'll work on clarifying the question some, thanks for the comment. I'll respond here but update the question when I have a little more time. 1-2 minutes is on a local machine, and the very long time comment is in Databricks. I agree that it's counter to my own experience as well and I suspect that it's a Databricks specific problem. – Foxhound013 Jan 11 '22 at 18:08

1 Answers1

1

I realize that this is an old question but I just came across it and I once struggled with the same problem. I was never able to find any support on this from the Databricks side, but I found that the bottleneck was related to how Databricks was moving data from the file store (s3 in my case). I ended up writing a wrapper function that would use botor (or boto3 from a system call) to copy the file from s3 to /tmp on the driver machine, and then call fread from there. Doing it this way resulted in fread times comparable to what you see locally.

Jordan C
  • 60
  • 6