0

What does toPandas() actually do when using arrows optimization?

Is the resulting pandas dataframe safe for wide transformations (that requires data shuffling) on the pandas dataframe eg..merge operations? what about group and aggregate? What kind of performance limitation should I expect?

I am trying to standardize to Pandas dataframe where possible, due to ease of unit testing and swapability with in-memory objects without starting the monstrous spark instance.

Alwyn
  • 8,079
  • 12
  • 59
  • 107
  • it seems the answer to this is likely to be no - it doesn't work with wide transformation. It somehow works when it's within the partition, but does not work well with really large merges – Alwyn Aug 30 '19 at 07:37

1 Answers1

0

toPandas() takes your spark dataframe object and pulls all partitions on the client driver machine as a pandas dataframe. Any operations on this new object (pandas dataframe) will be running on a single machine with python therefore no wide transformations will be possible because you aren't using spark cluster distributed computing anymore (i.e. no partitions/worker node interaction).

thePurplePython
  • 2,621
  • 1
  • 13
  • 34
  • When reading the documentation for spark arrow however it's supposed to allow pandas computation to push to executors is this false? – Alwyn Aug 31 '19 at 18:50
  • yes - official docs "Using the above optimizations with Arrow will produce the same results as when Arrow is not enabled. Note that even with Arrow, toPandas() results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data" - https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html – thePurplePython Aug 31 '19 at 21:09