I'm trying to do something really simple which somehow translates into something really difficult when Pyspark is involved.
I have a really large dataframe (~2B rows) on our platform which I'm not allowed to download but only analyse using Pyspark code. The dataframe contains position of some objects over Europe in the last year and I want to compute the density of those objects over time. I've succesfully used the function numpy.histogram2d
in the past with good results (it's the faster that I've found in numpy
at least). Since there is no equivalent of this function in pyspark
I've defined a UDF to compute the density and return a new dataframe. This works when I only process a few rows (I've tried with 100K rows):
import pandas as pd
import numpy as np
def compute_density(df):
lon_bins = np.linspace(-15, 45, 100)
lat_bins = np.linspace(35, 70, 100)
density, xedges, yedges = np.histogram2d(df["corrected_latitude_degree"].values,
df["corrected_longitude_degree"].values,
[lat_bins, lon_bins])
x2d, y2d = np.meshgrid(xedges[:-1], yedges[:-1])
x_out = x2d.ravel()
y_out = y2d.ravel()
density_out = density.ravel()
data = {
'latitude': x_out,
'longitude': y_out,
'density': density_out
}
return pd.DataFrame(data)
which I then call as this
schema = StructType([
StructField("latitude", DoubleType()),
StructField("longitude", DoubleType()),
StructField("density", DoubleType())
])
preproc = (
inp
.limit(100000)
.withColumn("groups", F.lit(0))
)
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def compute_density_udf(df):
return compute_density(df)
result = preproc.groupby(["groups"]).apply(compute_density_udf)
Why am I using the GROUPED_MAP
version to apply the UDF? I didn't manage to get it work with the SCALAR
type UDF when returning with a schema, although I don't really need to group.
When I try to use this UDF on the full dataset I get an OOM, cause I believe there is only one group and is too much for the UDF to process. I'm sure there is a smarter way to compute this directly with pyspark
without an UDF or alternatively split into groups and then assemble the results at the end? Does anyone have any idea/suggestion?