How do I distribute application of a function that returns a scalar over a grouped dataframe using pandas API on Spark with Azure Databricks?

Question

[I managed to answer my own question in a narrow sense, but hopefully someone who knows more than I do can explain why my solution works and give a more general answer.]

I am new to Databricks, Spark, and modern distributed computing. (I do understand parallel processing in general and wrote lower-level concurrent code way back when.)

I've got some Python code that uses pandas. It applies a function over a grouped dataframe to get a series of results indexed by group. I'd like to parallelize it using Databricks with as little pandas-removing surgery as possible.

I was hoping the pandas API on Spark would be what I needed, but I don't know how to distribute function application when the function returns a scalar. Here is a simplified example:

# from quick start online...https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession

# pandas dataframe to try groupby.apply on
pdf = pd.DataFrame({'g': ['a', 'a', 'b', 'b', 'c', 'c'], 'v': [0, 1, 2, 3, 4, 5]})

# converted to pandas on spark api dataframe
psdf = ps.from_pandas(pdf)

# function to apply (that doesn't return a dataframe)
import time
def takes_time(group):
  time.sleep(1)
  return str(group.g.unique()) + ", " + str(pd.Timestamp.now())

# this seems to run sequentially instead of in parallel
psdf.groupby('g').apply(takes_time)

I can see that my little one-second wait shows up in the output:

Out[14]: g
b    ['b'], 2022-08-04 00:16:02.543390
c    ['c'], 2022-08-04 00:16:03.545876
a    ['a'], 2022-08-04 00:16:01.540277
dtype: object

I can get it to run in parallel using Spark via RDD operations, but I lose my index:

# seems to run in parallel as expected (but the result is just a list with no index)
groups_in_list = [group for (name, group) in pdf.groupby('g')]
sc.parallelize(groups_in_list).map(takes_time).collect()

At any rate the output at least looks like it ran in parallel:

Out[15]: ["['a'], 2022-08-04 00:16:16.724457",
 "['b'], 2022-08-04 00:16:16.789707",
 "['c'], 2022-08-04 00:16:16.712051"]

(I do note that the pandas API on Spark docs say that groupby.apply needs to be passed a function that returns a dataframe, vs. the pandas docs that also allow a series or a scalar return.)

I thought I might get cute and try to map the function over a pyspark.pandas series, but ps.Series(groups_in_list) fails with ArrowInvalid: Could not convert g v

I think I must be missing something really basic. Can someone set me straight about the right way to do this?

I don't think you are missing something basic. There are just some unexpected behaviors with pyspark.pandas. I am not entirely sure but I think this has to do with the index types of Pyspark.Pandas as seen here: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html?highlight=distributed#use-distributed-or-distributed-sequence-default-index . I think the question here is why you want to maintain the index? Spark has no index because it doesn't make as much sense to maintain order in a distributed setting. — Kevin Kho, Aug 04 '22 at 06:03
@KevinKho, need to keep track of which group got processed to which series element. That's not really a problem since I could always store group ID info when I call my time-consuming function. But resorting to lower-level hoop-jumping makes me think I'm missing something about how pandas API on Spark is supposed to work or how to best use it. — jtolle, Aug 04 '22 at 13:48

score 0 · Answer 1 · answered Aug 13 '22 at 01:22

After learning more about Spark, I revisited this. I can now provide a answer, at least as far as my example is concerned. I'll hold off on accepting it for a while though, in case someone more knowledgeable wants to explain why my solution works and give a more general answer.

I think my original example was just too small, and it wasn't accounting for how Spark works. But if I change three things, I get an example that clearly runs in a distributed way despite still not returning a dataframe:

Use an example with more groups
Turn off Spark's adaptive query execution
Provide type hints for the function being applied

Here is the new code:

# boy do I feel dumb...
# this works with more groups, plus adaptive query exec off, plus type hints

import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession

# turn off adaptive query execution
spark.conf.set('spark.sql.adaptive.enabled', 'false')

# this doesn't appear necessary for a still-small example
# ps.set_option('compute.default_index_type', 'distributed')

and

# use more groups
pdf2 = pd.DataFrame({'g': pd.Series(range(20)).repeat(2), 'v': range(20*2)})

psdf2 = ps.from_pandas(pdf2)

# looks like the type hints are also needed 
import time
def takes_time2(group: pd.DataFrame) -> str:
  time.sleep(1)
  return str(group.g.unique()) + ", " + str(pd.Timestamp.now())

psdf2.groupby('g').apply(takes_time2)

Now I get something more like I'd expect with 8 cores available :

Out[4]: 0     [19], 2022-08-13 00:56:30.997594
1      [0], 2022-08-13 00:56:30.917214
2      [7], 2022-08-13 00:56:30.839966
3      [6], 2022-08-13 00:56:30.962699
4      [9], 2022-08-13 00:56:30.934737
5     [17], 2022-08-13 00:56:30.847639
6      [5], 2022-08-13 00:56:30.946205
7      [1], 2022-08-13 00:56:30.974423
8     [10], 2022-08-13 00:56:31.993180
9      [3], 2022-08-13 00:56:32.129933
10    [12], 2022-08-13 00:56:32.153952
11     [8], 2022-08-13 00:56:32.105308
12    [11], 2022-08-13 00:56:32.123548
13     [2], 2022-08-13 00:56:32.163459
14     [4], 2022-08-13 00:56:32.203582
15    [13], 2022-08-13 00:56:32.251066
16    [18], 2022-08-13 00:56:33.074744
17    [14], 2022-08-13 00:56:33.176868
18    [15], 2022-08-13 00:56:33.214010
19    [16], 2022-08-13 00:56:33.302277
Name: 0, dtype: object

Hat tip to the article linked below, which is how I learned about the adaptive query execution. It notes, "AQE should normally be left enabled, but it can combine small Spark tasks into larger tasks." Presumably my toy example qualifies.

https://www.databricks.com/blog/2022/07/20/parallel-ml-how-compass-built-a-framework-for-training-many-machine-learning-models-on-databricks.html

How do I distribute application of a function that returns a scalar over a grouped dataframe using pandas API on Spark with Azure Databricks?

1 Answers1