Polars dataframe drop nans

Question

I need to drop rows that have a nan value in any column. As for null values with drop_nulls()

df.drop_nulls()

but for nans. I have found that the method drop_nans exist for Series but not for DataFrames

df['A'].drop_nans()

Pandas code that I'm using:

df = pd.DataFrame(
    {
        'A': [0, 0, 0, 1,None, 1],
        'B': [1, 2, 2, 1,1, np.nan]
    }
)
df.dropna()

Does this answer your question? [polars dropna equivalent on list of columns](https://stackoverflow.com/questions/73971106/polars-dropna-equivalent-on-list-of-columns) — G. Anderson, Feb 23 '23 at 17:56

jqurious · Accepted Answer · 2023-02-23T18:43:03.953

3

Not sure why it currently only exists as a Series method.

You can use .filter() to emulate the behaviour then call .drop_nulls()

>>> df.filter(pl.all(pl.col(pl.Float32, pl.Float64).is_not_nan())).drop_nulls()
shape: (4, 2)
┌─────┬─────┐
│ A   | B   │
│ --- | --- │
│ i64 | f64 │
╞═════╪═════╡
│ 0   | 1.0 │
│ 0   | 2.0 │
│ 0   | 2.0 │
│ 1   | 1.0 │
└─────┴─────┘

edited Feb 23 '23 at 18:43

answered Feb 23 '23 at 18:32

jqurious

9,953
1
4
14

What if I don't know the number of columns and their types? Without hardcoded `pl.Float32, pl.Float64` – EnesZ Feb 23 '23 at 18:37
`pl.col(pl.Float32, pl.Float64)` selects all float columns. Only float columns can contain `NaN` – jqurious Feb 23 '23 at 18:40
Ok, I tried to get a list with all float column names `float_cols` and then use it in `pl.col(float_cols)`. It worked. I will write that answer below. – EnesZ Feb 23 '23 at 19:01
2

`pl.col(pl.Float32, pl.Float64)` already selects the columns without needing to name them e.g. `pl.DataFrame(dict(A=[1], B=[2.0], C=["hi"], D=[3.0])).select(pl.col(pl.Float32, pl.Float64))` – jqurious Feb 24 '23 at 09:12
I see now, you are right. – EnesZ Feb 24 '23 at 13:24

Dean MacGregor · Answer 2 · 2023-02-23T20:20:49.520

2

If you have mixed nulls and nans then the easiest thing to do is replace the nans with nulls then use drop_nulls()

df.with_columns(pl.col(pl.Float32, pl.Float64).fill_nan(None)).drop_nulls()

From inside out:

pl.col(pl.Float32, pl.Float64) picks all the columns that are floats and hence able to be nan.

fill_nan(None) replaces any nan value with, in this case, None which is a proper null

drop_nulls() does exactly what it seems like it does.

edited Feb 23 '23 at 20:20

answered Feb 23 '23 at 18:50

Dean MacGregor

11,847
9
34
72

It doesn't work for me with this daframe `df = pl.DataFrame({'A': [0, 1.0, 1, np.nan, 2],'B': ['1', '1','1','1', None]})` – EnesZ Feb 23 '23 at 19:06
@EnesZ I did another edit that simplifies the expression by a lot. – Dean MacGregor Feb 23 '23 at 20:21
Yes, it works. The only problem is that the data frame might have hundreds of columns with different types, so hardcoded `pl.Float32, pl.Float64` is not the best solution. Please check my solution where I used a list with float column names. – EnesZ Feb 23 '23 at 22:57
1

I don't know why you are calling it "hardcoded" as though it's relying on some static column names. Your solution does the same thing, albeit in a more roundabout way having to lookup the column names first. – Dean MacGregor Feb 24 '23 at 00:01
1

Yes, you are right. I thought that `pl.Float32, pl.Float64` means that we want to select two columns or `pl.Float32, pl.Float64, pl.Float64` for three columns, etc. Now I see that it will select all float columns. – EnesZ Feb 24 '23 at 13:26

score 0 · Answer 3 · answered Feb 23 '23 at 19:03

As @jqurious suggested but with column names

df = pl.DataFrame(
    {
        'A': [0, 1.0, 1, np.nan, 2],
        'B': ['1', '1','1','1', None]
    }
)

# get all columns that have a float type
float_col = df.columns
float_col = [c for c in float_col if df[c].dtype in [pl.Float64, pl.Float32]]

df.filter(pl.all(pl.col(float_col).is_not_nan())).drop_nulls()

score -2 · Answer 4 · edited Feb 28 '23 at 09:16

-2

Try this:

import polars as pl
import numpy as np

# create a DataFrame with some NaN values
df = pl.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': ['foo', 'bar', 'app', 'ctx', 'mpq']
})

df.to_pandas().dropna()

edited Feb 28 '23 at 09:16

Sunderam Dubey

1
11
20
40

answered Feb 23 '23 at 18:11

code_adithya

25
4

Nice idea, I could use something like this `pl.from_pandas(df.to_pandas().dropna())` if there is no other solution. – EnesZ Feb 23 '23 at 18:15
3

Don't do this. Round tripping to pandas is not a good solution. For toy data you won't notice but bigger data is going suffer from a ton of unnecessary overhead. – Dean MacGregor Feb 23 '23 at 18:40

Polars dataframe drop nans

4 Answers4