0

Given a Polars DataFrame, is there a way to replace a particular value by "null"? For example, if there's a sentinel value like "_UNKNOWN" and I want to make it truly missing in the dataframe instead.

astrojuanlu
  • 6,744
  • 8
  • 45
  • 105

3 Answers3

5

You can use .when().then().otherwise()

pl.col(pl.Utf8) is used to select all "string columns".

df = pl.DataFrame({
   "A": ["a", "b", "_UNKNOWN", "c"], 
   "B": ["_UNKNOWN", "d", "e", "f"], 
   "C": [1, 2, 3, 4]
})

df.with_columns(
   pl.when(pl.col(pl.Utf8) == "_UNKNOWN")
     .then(None)
     .otherwise(pl.col(pl.Utf8)) # keep original value
     .keep_name()
)

shape: (4, 3)
┌──────┬──────┬─────┐
│ A    | B    | C   │
│ ---  | ---  | --- │
│ str  | str  | i64 │
╞══════╪══════╪═════╡
│ a    | null | 1   │
├──────┼──────┼─────┤
│ b    | d    | 2   │
├──────┼──────┼─────┤
│ null | e    | 3   │
├──────┼──────┼─────┤
│ c    | f    | 4   │
└─//───┴─//───┴─//──┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
3

This is really a tweak of @jqurious's answer.

When you do a when and your condition isn't met then the default is null so you can just do:

df.with_columns(
    pl.when(pl.col(pl.Utf8) != "_UNKNOWN")
        .then(pl.col(pl.Utf8)) # keep original value
        .keep_name()
)

If you have multiple null conditions say null_strings=["_UNKNOWN", "UNK", "who_knows"] then you can use a is_in like this:

df.with_columns(
    pl.when(~pl.col(pl.Utf8).is_in(null_strings))
        .then(pl.col(pl.Utf8)) # keep original value
        .keep_name()
)
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
1

In addition to previous answers, if you have just imported dataset, you can specify what values will be null

Example data.csv:

A,B
a,1
b,2
_UNKNOWN,more than 3
c,_UNKNOWN

when importing dataset:

pl.read_csv("data.csv",
            null_values="_UNKNOWN",
            # infer_schema_length=0,  # optional (if you will get parsing error))

df:

┌──────┬─────────────┐
│ A    ┆ B           │
│ ---  ┆ ---         │
│ str  ┆ str         │
╞══════╪═════════════╡
│ a    ┆ 1           │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b    ┆ 2           │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ more than 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c    ┆ null        │
└──────┴─────────────┘
glebcom
  • 1,131
  • 5
  • 14