data shuffling issues with pyspark on multi-node cluster

Question

I have time-series data, which I need to interpolate. There are several devices, that send the data for various submodules, that are connected to this device. I now want to interpolate the data per device and per submodule on a common time-vector.

My implementation is currently as follows:

I have time-series data as a spark dataframe
I group the dataframe per device and per submodule
I apply a pandasUDF, that takes care of the interpolation
Within the pandasUDF I also sort the data based on the time-column to correctly interpolate it

The table, from which I read in the time-series data is already partitioned by the device.

so the code is as follow:

df = spark.read.table(time_series_table)
df_interolated = df.groupBy('device', 'submodule').apply(interpolate_udf)

The results unfortunately are not as expected. Somehow it's due to the groupBy-operation not reshuffling the data correctly.

However when I repartition the dataframe before I apply the grouping+interpolation, the results are as expected:

df = spark.read.table(time_series_table)
df_partitioned = df.repartition('device', 'submodule')
df_interolated = df_partitioned.groupBy('device', 'submodule').apply(interpolate_udf)

I just read through a lot of threads, documentation to understand the groupBy-operation in pyspark. According to various sources it is supposed to correctly reshuffle the data on a multi-node cluster on databricks. Apparently for me at least this is not the case. Am I misunderstanding the behavior of groupBy?

Thanks in advance.

data shuffling issues with pyspark on multi-node cluster

0 Answers0