17

When running simple SQL commands in Databricks, sometimes I get the message:

Determining location of DBIO file fragments. This operation can take some time.

What does this mean, and how do I prevent it from having to perform this apparently-expensive operation every time? This happens even when all the underlying tables are Delta tables.

David Maddox
  • 1,884
  • 3
  • 21
  • 32

1 Answers1

11

That is a message about the delta cache. It’s determines on which executors it has what cached, to route tasks for best cached locality. Optimizing your table more frequently so there are fewer files will make this better.

Joe Widen
  • 2,378
  • 1
  • 15
  • 21
  • 6
    Also, using fewer machines will cut down on this time as well. – Joe Widen Aug 13 '20 at 19:18
  • When you say optimizing table is that a spark config or a code that needs to be run on a scheduler. I am using delta caching and Runtime 10.3 in Azure and I keep getting that message. We also have a vacuum job, but even after running it I still get the above message. Any help would be much appreciated. – Vitali Dedkov Feb 09 '22 at 16:31
  • 1
    @VitaliDedkov run `%sql Optimize [table name]` – David Maddox Apr 29 '22 at 13:32
  • 1
    Hi @DavidMaddox so I actually did create a notebook that does that and it did help. At the time (I am not sure what the issue was) I was not doing that and then the issue went away. I am not sure if the Optimize command would have solved it, but from what I read (and as other commented) it probably would helped solve it as well. – Vitali Dedkov May 03 '22 at 13:41
  • @VitaliDedkov This is an old thread but I've encountered the same problem and no solution found seemed to help. Do you mind sharing what worked for you? – user17101610 Jan 31 '23 at 19:19
  • Hi @user17101610, So honestly this issue came up and went away a couple of times. What I did first is to run the Vacuum dry run command first to see if there were old files that needed to be vacuumed. After that I ran Optimize command on the table, then vacuum without the dry run. It went away sometimes, but I guess a silly solution to never see that again is not to use delta cache enabled VMs from Databricks. Let me know if that is enough details to help you. – Vitali Dedkov Feb 15 '23 at 17:27