2

I have a Polars Dataframe with a mix of Series, which I want to write to a CSV / Upload to a Database.

The problem is if any of the UTF8 series have non-ASCII characters, it is failing due to the DB Type I'm using so I would like to filter out the non-ASCII characters, whilst leaving everything else.

I created a function that uses a lambda function, which does work, but it is slow compared with standard Polars functions and I was hoping to replace this with a Polars alternative

def df_column_clean(df:pl.DataFrame, drop_non_ascii:bool=False):
    """
    Takes a Polars Dataframe and performs data cleaning on all columns
    Currently it only converts string series to ascii but can be expanded in the future
    """
    if drop_non_ascii:
        df_changes = []
        df_columns = df.schema
        for col_name, col_type in df_columns.items():
            if col_type != pl.Utf8:
                continue

            # Remove non-ascii characters
            df_changes.append(pl.col(col_name).apply(lambda x: None if x is None else x.encode('ascii', 'ignore').decode('ascii'), skip_nulls=False))

        if len(df_changes) > 0:
            return df.with_columns(df_changes)
    return df

Is the method I came up with the best option or does Polars have an-inbuilt function that can be used to filter out non-ASCII characters?

Thanks in advance

Cade
  • 58
  • 5

1 Answers1

3

.replace_all() with a regex to match non-ascii chars:

pl.col(pl.Utf8).str.replace_all("[^\p{Ascii}]", "")
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • That's brilliant. I had no idea you could apply a rule on all columns of a certain type with pl.col(pl.Utf8). Thank you so much. – Cade Aug 20 '23 at 14:11
  • 1
    For completeness: there's also [`cs.string()`](https://pola-rs.github.io/polars/py-polars/html/reference/selectors.html#polars.selectors.string) in recent Polars versions. – Wayoshi Aug 20 '23 at 15:00