[I managed to answer my own question in a narrow sense, but hopefully someone who knows more than I do can explain why my solution works and give a more general answer.]
I am new to Databricks, Spark, and modern distributed computing. (I do understand parallel processing in general and wrote lower-level concurrent code way back when.)
I've got some Python code that uses pandas. It applies a function over a grouped dataframe to get a series of results indexed by group. I'd like to parallelize it using Databricks with as little pandas-removing surgery as possible.
I was hoping the pandas API on Spark would be what I needed, but I don't know how to distribute function application when the function returns a scalar. Here is a simplified example:
# from quick start online...https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession
# pandas dataframe to try groupby.apply on
pdf = pd.DataFrame({'g': ['a', 'a', 'b', 'b', 'c', 'c'], 'v': [0, 1, 2, 3, 4, 5]})
# converted to pandas on spark api dataframe
psdf = ps.from_pandas(pdf)
# function to apply (that doesn't return a dataframe)
import time
def takes_time(group):
time.sleep(1)
return str(group.g.unique()) + ", " + str(pd.Timestamp.now())
# this seems to run sequentially instead of in parallel
psdf.groupby('g').apply(takes_time)
I can see that my little one-second wait shows up in the output:
Out[14]: g
b ['b'], 2022-08-04 00:16:02.543390
c ['c'], 2022-08-04 00:16:03.545876
a ['a'], 2022-08-04 00:16:01.540277
dtype: object
I can get it to run in parallel using Spark via RDD operations, but I lose my index:
# seems to run in parallel as expected (but the result is just a list with no index)
groups_in_list = [group for (name, group) in pdf.groupby('g')]
sc.parallelize(groups_in_list).map(takes_time).collect()
At any rate the output at least looks like it ran in parallel:
Out[15]: ["['a'], 2022-08-04 00:16:16.724457",
"['b'], 2022-08-04 00:16:16.789707",
"['c'], 2022-08-04 00:16:16.712051"]
(I do note that the pandas API on Spark docs say that groupby.apply
needs to be passed a function that returns a dataframe, vs. the pandas docs that also allow a series or a scalar return.)
I thought I might get cute and try to map the function over a pyspark.pandas series, but ps.Series(groups_in_list)
fails with ArrowInvalid: Could not convert g v
I think I must be missing something really basic. Can someone set me straight about the right way to do this?