PySpark function to handle null values with poor performance - Need optimization suggestions

Question

I have a PySpark function called fillnulls that handles null values in my dataset by filling them with appropriate values based on the column type. However, I've noticed that the function's performance is not optimal, especially when dealing with large datasets. I believe there might be better approaches or optimizations to make it more efficient.

Here's a brief overview of the fillnulls function:

def fillnulls(df):
    for colname in df.columns:
        dtype = df.schema[colname].dataType
        colname_new = colname + "_filled"
        if isinstance(dtype, T.StringType):
            try:
                # count values in column
                df = df.withColumn("count", F.count(colname).over(Window.partitionBy(colname)))

                # get most frequent value
                df = df.withColumn("mode",
                                   F.array_max(F.collect_list(F.col("count")).over(Window.partitionBy(F.lit(1)))))

                # fill values with according to the filling logic
                df = df.withColumn(colname_new, F.when(F.col(colname).isNull(), F.col("mode")))
                df = df.withColumn(colname_new,
                                   F.when(F.col("count") < 100, F.lit("other")).otherwise(F.col(colname)))

                # drop columns
                df = df.drop(colname, "count", "mode")
                df = df.withColumnRenamed(colname_new, colname)

            except Py4JJavaError:
                # if there're no values at all in the col
                df = df.fillna({colname: 'null'})
        else:
            try:
                # fill nulls with mean
                df = df.withColumn("mean", F.mean(colname).over(Window.partitionBy(F.lit(1))))
                df = df.withColumn(colname_new,
                                   F.when(F.col(colname).isNull(), F.col("mean")).otherwise(F.col(colname)))

                # drop columns
                df = df.drop(colname, "mean")
                df = df.withColumnRenamed(colname_new, colname)
            except Py4JJavaError:
                # if there're no values at all in the col
                df = df.fillna({colname: -1})
    return df

The function handles string columns differently from numeric columns. For string columns, it calculates the most frequent value and fills nulls accordingly. It also counts for values that appear less than 100 times and fill them with "other". For numeric columns, it fills nulls with the mean.

However, this function seems to be performing poorly on large datasets. It takes a long time to process, and I suspect there might be more efficient ways to achieve the same result.

Could you please review the function and suggest any improvements or optimizations that can make it faster and more efficient?

Additionally, if you have any best practices or alternative approaches for handling null values in PySpark, I'd be grateful to learn about them.

Thank you in advance for your help!

C.Nivs · Answer 1 · 2023-07-21T21:05:30.317

A lot of your calculations can be handled by df.describe(). With this in mind, we can build a map for df.fillna and return that:

def get_na_map(df):
    filler = {}

    # will calculate the mean, mode, max, min, and count
    # for you, so start here
    stats = df.describe()

    # iterate over the dtypes rather than a lookup
    # on schema every time
    for c, type_ in df.dtypes:
        # this happens to me on boolean columns sometimes
        if c not in stats.columns:
            print(f'skipping {c} because no stats were calculated')
            continue

        # grab each columns stats that you want
        # I needed the select because the agg didn't
        # work the way I wanted
        col_stats = (
            cstats
            .groupBy('summary', c)
            .pivot('summary')
            .agg(F.max(c))
            .select(*(F.max(col).alias(f'col_{col}') for col in ['count', 'max', 'mean']))
            .head(1)
        )[0]

        # count should always be an int
        count = int(col_stats.col_count)
        mode = col_stats.col_max
        mean = col_stats.col_mean
        
        # handle the no records issue here
        # for string
        if type_ == 'string' and not count:
            filler[c] = 'null'
        elif type_ == 'string' and count < 100:
            filler[c] = 'other'
        elif type_ == 'string' and count >= 100:
            filler[c] = mode
        elif count:
            filler[c] = float(mean) if mean is not None else mean
        else:
            # no records case also handled here
            filler[c] = -1

    return filler


na_vals = get_na_map(df)

# see what your fill_values are
print(na_vals)

# apply the map here
df.fillna(na_vals).show()

Thank you for your solution, but I have a couple of concerns: The "max" field from df.describe() for string columns doesn't provide the mode, which is what I need. It seems to return the highest character value instead. My intention was to handle null values by replacing them with the mode of each column. Additionally, I wanted to replace values occurring less than 100 times in a column with "other." However, the current function replaces the entire column with "other" if there are less than 100 rows in the entire DF, which is not the behavior I had in mind.Could you please address these? — Shaked Nave, Jul 23 '23 at 07:32

PySpark function to handle null values with poor performance - Need optimization suggestions

1 Answers1