Efficiently compute 2D histogram with Pyspark (Numpy and UDF)

Question

I'm trying to do something really simple which somehow translates into something really difficult when Pyspark is involved.

I have a really large dataframe (~2B rows) on our platform which I'm not allowed to download but only analyse using Pyspark code. The dataframe contains position of some objects over Europe in the last year and I want to compute the density of those objects over time. I've succesfully used the function numpy.histogram2d in the past with good results (it's the faster that I've found in numpy at least). Since there is no equivalent of this function in pyspark I've defined a UDF to compute the density and return a new dataframe. This works when I only process a few rows (I've tried with 100K rows):

import pandas as pd
import numpy as np

def compute_density(df):
    lon_bins = np.linspace(-15, 45, 100)
    lat_bins = np.linspace(35, 70, 100)

    density, xedges, yedges = np.histogram2d(df["corrected_latitude_degree"].values,
                                             df["corrected_longitude_degree"].values,
                                             [lat_bins, lon_bins])
    x2d, y2d = np.meshgrid(xedges[:-1], yedges[:-1])
    x_out = x2d.ravel()
    y_out = y2d.ravel()
    density_out = density.ravel()
    data = {
            'latitude': x_out,
            'longitude': y_out,
            'density': density_out
            }
    return pd.DataFrame(data)

which I then call as this

schema = StructType([
    StructField("latitude", DoubleType()),
    StructField("longitude", DoubleType()),
    StructField("density", DoubleType())
])

preproc = (
    inp
    .limit(100000)
    .withColumn("groups", F.lit(0))
)
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def compute_density_udf(df):
    return compute_density(df)

result = preproc.groupby(["groups"]).apply(compute_density_udf)

Why am I using the GROUPED_MAP version to apply the UDF? I didn't manage to get it work with the SCALAR type UDF when returning with a schema, although I don't really need to group.

When I try to use this UDF on the full dataset I get an OOM, cause I believe there is only one group and is too much for the UDF to process. I'm sure there is a smarter way to compute this directly with pyspark without an UDF or alternatively split into groups and then assemble the results at the end? Does anyone have any idea/suggestion?

Did you manage to find something for this in the end? – fffrost Nov 04 '20 at 23:10 — fffrost, Nov 04 '20 at 23:10

Efficiently compute 2D histogram with Pyspark (Numpy and UDF)

0 Answers0