1

I need to drop rows that have a nan value in any column. As for null values with drop_nulls()

df.drop_nulls()

but for nans. I have found that the method drop_nans exist for Series but not for DataFrames

df['A'].drop_nans()

Pandas code that I'm using:

df = pd.DataFrame(
    {
        'A': [0, 0, 0, 1,None, 1],
        'B': [1, 2, 2, 1,1, np.nan]
    }
)
df.dropna()
EnesZ
  • 403
  • 3
  • 16
  • Does this answer your question? [polars dropna equivalent on list of columns](https://stackoverflow.com/questions/73971106/polars-dropna-equivalent-on-list-of-columns) – G. Anderson Feb 23 '23 at 17:56
  • No, sorry. I haven't found my answer there. – EnesZ Feb 23 '23 at 18:11

4 Answers4

3

Not sure why it currently only exists as a Series method.

You can use .filter() to emulate the behaviour then call .drop_nulls()

>>> df.filter(pl.all(pl.col(pl.Float32, pl.Float64).is_not_nan())).drop_nulls()
shape: (4, 2)
┌─────┬─────┐
│ A   | B   │
│ --- | --- │
│ i64 | f64 │
╞═════╪═════╡
│ 0   | 1.0 │
│ 0   | 2.0 │
│ 0   | 2.0 │
│ 1   | 1.0 │
└─────┴─────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • What if I don't know the number of columns and their types? Without hardcoded `pl.Float32, pl.Float64` – EnesZ Feb 23 '23 at 18:37
  • `pl.col(pl.Float32, pl.Float64)` selects all float columns. Only float columns can contain `NaN` – jqurious Feb 23 '23 at 18:40
  • Ok, I tried to get a list with all float column names `float_cols` and then use it in `pl.col(float_cols)`. It worked. I will write that answer below. – EnesZ Feb 23 '23 at 19:01
  • 2
    `pl.col(pl.Float32, pl.Float64)` already selects the columns without needing to name them e.g. `pl.DataFrame(dict(A=[1], B=[2.0], C=["hi"], D=[3.0])).select(pl.col(pl.Float32, pl.Float64))` – jqurious Feb 24 '23 at 09:12
  • I see now, you are right. – EnesZ Feb 24 '23 at 13:24
2

If you have mixed nulls and nans then the easiest thing to do is replace the nans with nulls then use drop_nulls()

df.with_columns(pl.col(pl.Float32, pl.Float64).fill_nan(None)).drop_nulls()

From inside out:

pl.col(pl.Float32, pl.Float64) picks all the columns that are floats and hence able to be nan.

fill_nan(None) replaces any nan value with, in this case, None which is a proper null

drop_nulls() does exactly what it seems like it does.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • It doesn't work for me with this daframe `df = pl.DataFrame({'A': [0, 1.0, 1, np.nan, 2],'B': ['1', '1','1','1', None]})` – EnesZ Feb 23 '23 at 19:06
  • @EnesZ I did another edit that simplifies the expression by a lot. – Dean MacGregor Feb 23 '23 at 20:21
  • Yes, it works. The only problem is that the data frame might have hundreds of columns with different types, so hardcoded `pl.Float32, pl.Float64` is not the best solution. Please check my solution where I used a list with float column names. – EnesZ Feb 23 '23 at 22:57
  • 1
    I don't know why you are calling it "hardcoded" as though it's relying on some static column names. Your solution does the same thing, albeit in a more roundabout way having to lookup the column names first. – Dean MacGregor Feb 24 '23 at 00:01
  • 1
    Yes, you are right. I thought that `pl.Float32, pl.Float64` means that we want to select two columns or `pl.Float32, pl.Float64, pl.Float64` for three columns, etc. Now I see that it will select all float columns. – EnesZ Feb 24 '23 at 13:26
0

As @jqurious suggested but with column names

df = pl.DataFrame(
    {
        'A': [0, 1.0, 1, np.nan, 2],
        'B': ['1', '1','1','1', None]
    }
)

# get all columns that have a float type
float_col = df.columns
float_col = [c for c in float_col if df[c].dtype in [pl.Float64, pl.Float32]]

df.filter(pl.all(pl.col(float_col).is_not_nan())).drop_nulls()
EnesZ
  • 403
  • 3
  • 16
-2

Try this:

import polars as pl
import numpy as np

# create a DataFrame with some NaN values
df = pl.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': ['foo', 'bar', 'app', 'ctx', 'mpq']
})

df.to_pandas().dropna()
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
  • Nice idea, I could use something like this `pl.from_pandas(df.to_pandas().dropna())` if there is no other solution. – EnesZ Feb 23 '23 at 18:15
  • 3
    Don't do this. Round tripping to pandas is not a good solution. For toy data you won't notice but bigger data is going suffer from a ton of unnecessary overhead. – Dean MacGregor Feb 23 '23 at 18:40