Implementing pythonic statistical functions on spark dataframes

Question

I have very large datasets in spark dataframes that are distributed across the nodes. I can do simple statistics like mean, stdev, skewness, kurtosis etc using the spark libraries pyspark.sql.functions .

If I want to use advanced statistical tests like Jarque-Bera (JB) or Shapiro-Wilk(SW) etc, I use the python libraries like scipy since the standard apache pyspark libraries don't have them. But in order to do that, I have to convert the spark dataframe to pandas, which means forcing the data into the master node like so:

import scipy.stats as stats
pandas_df=spark_df.toPandas()
JBtest=stats.jarque_bera(pandas_df)
SWtest=stats.shapiro(pandas_df)

I have multiple features, and each feature ID corresponds to a dataset on which I want to perform the test statistic.

My question is:

Is there a way to apply these pythonic functions on a spark dataframe while the data is still distributed across the nodes, or do I need to create my own JB/SW test statistic functions in spark?

Thank you for any valuable insight

Does this answer your question? [Implementing pythonic statistical functions on spark and pandas dataframes interchangebly](https://stackoverflow.com/questions/63862410/implementing-pythonic-statistical-functions-on-spark-and-pandas-dataframes-inter) — werner, Sep 13 '20 at 20:17

AltShift · Answer 1 · 2020-09-14T22:26:30.480

0

Yous should be able to define a vectorized user-defined function that wraps the Pandas function (https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), like this:

from pyspark.sql.functions import pandas_udf, PandasUDFType
import scipy.stats as stats

@pandas_udf('double', PandasUDFType.SCALAR)

def vector_jarque_bera(x):
    return stats.jarque_bera(x)

# test:
spark_df.withColumn('y', vector_jarque_bera(df['x']))

Note that the vectorized function column takes a column as its argument and returns a column.

(Nb. The @pandas_udf decorator is what transforms the Pandas function defined right below it into a vectorized function. Each element of the returned vector is itself a scalar, which is why the argument PandasUDFType.SCALAR is passed.)

edited Sep 14 '20 at 22:26

answered Sep 13 '20 at 22:48

AltShift

336
3
18

Thank you for this answer. When I tried it, I get the following error: `RuntimeError: Result vector from pandas_udf was not the required length: expected 10000, got 2` Is there a min required length for the pandas_udf? – thentangler Sep 14 '20 at 16:03
1

Ah, sorry... I assumed the stats.jarque_bera function returned a Pandas series, but it actually returns two scalars. This is not suitable for vectorization. I think you need to find (or write) a parallelized implementation. – AltShift Sep 18 '20 at 00:17

Implementing pythonic statistical functions on spark dataframes

1 Answers1