I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError
but when I install distributed
to solve this DataBricks just says Cancelled
without throwing any errors.

- 3,558
- 5
- 39
- 49
2 Answers
Anyone looking for an answer, check this medium blogpost. To prevent people from missing this in comments, I'm posting this as an answer.

- 1,030
- 11
- 34
I don't think we have heard of anyone using Dask under databricks, but so long as it's just python, it may well be possible.
The default scheduler for Dask is threads, and this is the most likely thing to work. In this case you don't even need to install distributed
.
For the Cancelled error, it sounds like you are using distributed, and, at a guess, the system is not allowing you to start extra processes (you could test this with the subprocess
module). To work around, you could do
client = dask.distributed.Client(processes=False)
Of course, if it is indeed the processes that you need, this would not be great. Also, I have no idea how you might expose the dashboard's port.

- 27,272
- 5
- 45
- 74
-
1This sadly still didn't work. However, this is starting to appear as a genuine limitation of Databricks itself which is sad because I actually think Dask is the future of distributed computing in python. – SARose Jun 06 '19 at 15:01
-
Don't tell Databricks that! :) – mdurant Jun 06 '19 at 15:36
-
Hi SARose - I'm curious as to WHY you want to use Dask on Databricks? ie. what's the driver here? – Rodney Jul 31 '19 at 04:40
-
1. For data-folks (RS, DS, MLEs), Spark errors and interplay between spark <=> ML libraries is just substandard at best \n 2. even with Koalas (pandas_on_spark), the underlying code is scala/spark. There are very opaque tasks that the actual user has usually no transparency into for debugging. – anakin Apr 18 '22 at 18:09