Convert spark dataframe to dask dataframe

Question

Is there a way to directly convert a Spark dataframe to a Dask dataframe.?

I currently am using Spark's .toPandas() function to convert it into a pandas dataframe and then into a dask dataframe. I believe this is inefficient operation and is not utilizing dask's distributed processing capabilities,since pandas will always be the bottleneck.

Some more information here, please: is this a single node setup (dask and spark), if not, do all dask workers have access to spark? — mdurant, Jul 18 '18 at 21:37
@mdurant this a full scale hadoop cluster.Dask is currently installed on the edge node of the cluster and there is a plan to use the dask-yarn package in the near future. — vva, Jul 18 '18 at 21:39

score 1 · Answer 1 · answered Jul 18 '18 at 22:26

1

I may be able to get you an efficient answer involving calling pyspark from each dask worker, but first I should point out that saving to parquet and loading the result may be the quickest and easiest method you can use.

answered Jul 18 '18 at 22:26

mdurant

27,272
5
45
74

I have a doubt. Wouldn't that be circular in logic.?. Using Spark to convert data to parquet and then use Dask. What if the data processed by Spark is already in parquet.? Is there anyway to ensure that Spark and Dask work together,like Pandas and Spark do.? – vva Jul 19 '18 at 00:20
5

If the data is already in parquet, load it directly with dask? – mdurant Jul 19 '18 at 13:08

Convert spark dataframe to dask dataframe

1 Answers1