Polars: fill nulls with the only vaild value within each group

Question

Each group only has one valid or not_null value in a random row. How do you fill each group with that value?

import polars as pl

data = {
    'group': ['1', '1', '1', '2', '2', '2', '3', '3', '3'],
    'col1': [1, None, None, None, 3, None, None, None, 5],
    'col2': ['a', None, None, None, 'b', None, None, None, 'c'],
    'col3': [False, None, None, None, True, None, None, None, False]
}
df = pl.DataFrame(data)

shape: (9, 4)
┌───────┬──────┬──────┬───────┐
│ group ┆ col1 ┆ col2 ┆ col3  │
│ ---   ┆ ---  ┆ ---  ┆ ---   │
│ str   ┆ i64  ┆ str  ┆ bool  │
╞═══════╪══════╪══════╪═══════╡
│ 1     ┆ 1    ┆ a    ┆ false │
│ 1     ┆ null ┆ null ┆ null  │
│ 1     ┆ null ┆ null ┆ null  │
│ 2     ┆ null ┆ null ┆ null  │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 2     ┆ null ┆ null ┆ null  │
│ 3     ┆ null ┆ null ┆ null  │
│ 3     ┆ null ┆ null ┆ null  │
│ 3     ┆ 5    ┆ c    ┆ false │
└───────┴──────┴──────┴───────┘

Desired output:

shape: (9, 4)
┌───────┬──────┬──────┬───────┐
│ group ┆ col1 ┆ col2 ┆ col3  │
│ ---   ┆ ---  ┆ ---  ┆ ---   │
│ str   ┆ i64  ┆ str  ┆ bool  │
╞═══════╪══════╪══════╪═══════╡
│ 1     ┆ 1    ┆ a    ┆ false │
│ 1     ┆ 1    ┆ a    ┆ false │
│ 1     ┆ 1    ┆ a    ┆ false │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 3     ┆ 5    ┆ c    ┆ false │
│ 3     ┆ 5    ┆ c    ┆ false │
│ 3     ┆ 5    ┆ c    ┆ false │
└───────┴──────┴──────┴───────┘

In pandas, I can do the following for each column

import pandas as pd

df = pd.DataFrame(data)
df.col1 = df.groupby('group').col.apply(lambda x: x.ffill().bfill())

How do you do this in polars, ideally with a window function (.over()) ?

Hi @steven, a window operation with min should do it: pl.min(‘col’).over(‘group’) — Luca, Mar 23 '23 at 00:25

Dean MacGregor · Accepted Answer · 2023-03-24T16:25:51.357

The immediate way to do exactly what you asked is (and it looks the most like your pandas approach):

df.with_columns(pl.exclude('group').forward_fill().backward_fill().over('group'))

using pl.all() instead of pl.exclude('group') also works but it'll save some theoretical time by not making it look through the group column for the fills.

If there's a list of columns you want to do this to (as opposed to all but group) then you can replace the pl.exclude with a generator or list comprehension

cols=['col1','col2','col3']
df.with_columns(pl.col(x).forward_fill().backward_fill().over('group') for x in cols)

You can even use regex in pl.col as long as you use the ^ and $ anchor.

df.with_columns(pl.col("^col\d$").forward_fill().backward_fill().over('group'))

Another approach besides forward/backward fills:

df.with_columns(pl.col("^col\d$").drop_nulls().first().over('group'))

If the first looks a little weird it's because the drop_nulls is going to return a different number of rows than the original df which will cause an error. If the expression is an aggregation (like sum, min, max, etc) then it doesn't complain about getting a different number of rows and, instead, just propagates that answer to all the rows. In this case first is the aggregation which just means the first thing it sees. Since the filter is only returning one thing we just need a way to tell it to propagate that.

The different column selection tricks work in this approach too but I'll spare the reader the extra copy/paste

Final note:

if your next step is to take unique then you should just do it as a df.groupby to start with

df \
    .groupby('group', maintain_order=True) \
    .agg(pl.col("^col\d$").drop_nulls().first())

score 1 · Answer 2 · answered Mar 23 '23 at 00:18

One option is to first extract the non-null rows in each group, and then join those values back into the table.

import polars as pl

data = pl.DataFrame{
    'group': ['1', '1', '1', '2', '2', '2', '3', '3', '3'],
    'col': [1, None, None, None, 3, None, None, None, 5]
})

data_non_null = (
    data
    .filter(pl.col('col').is_not_null())
    .select('group', col_non_null='col')
)

data = (
    data
    .join(data_non_null, on='group')
    .select('group', col='col_non_null')
)

This is an application of a very general pattern that should also work in Pandas, SQL, the various R data frame libraries, etc.

Note that usually you need some kind of grouping operation to apply this pattern. Here however we know/assume that there is exactly one non-null value in each group, so we just have to filter for non-null values.

This indeed works, but i wonder if there is a window function way... — steven, Mar 23 '23 at 00:26

Polars: fill nulls with the only vaild value within each group

2 Answers2