7

Is there a way to directly convert a Spark dataframe to a Dask dataframe.?

I currently am using Spark's .toPandas() function to convert it into a pandas dataframe and then into a dask dataframe. I believe this is inefficient operation and is not utilizing dask's distributed processing capabilities,since pandas will always be the bottleneck.

vva
  • 133
  • 4
  • 11
  • Some more information here, please: is this a single node setup (dask and spark), if not, do all dask workers have access to spark? – mdurant Jul 18 '18 at 21:37
  • @mdurant this a full scale hadoop cluster.Dask is currently installed on the edge node of the cluster and there is a plan to use the dask-yarn package in the near future. – vva Jul 18 '18 at 21:39
  • did you manage to convert spark df to dask df? – Coder Oct 13 '22 at 14:56

1 Answers1

1

I may be able to get you an efficient answer involving calling pyspark from each dask worker, but first I should point out that saving to parquet and loading the result may be the quickest and easiest method you can use.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I have a doubt. Wouldn't that be circular in logic.?. Using Spark to convert data to parquet and then use Dask. What if the data processed by Spark is already in parquet.? Is there anyway to ensure that Spark and Dask work together,like Pandas and Spark do.? – vva Jul 19 '18 at 00:20
  • 5
    If the data is already in parquet, load it directly with dask? – mdurant Jul 19 '18 at 13:08