0

This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?

(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))

import polars as pl


df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
    .with_column(
        pl.concat_str([
            pl.lit('['),
            pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
            pl.lit(']')
        ])
        .alias('docs_str')
    )
    .drop('doc_ids')
).collect()
Tim
  • 236
  • 2
  • 8

1 Answers1

1

In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.

Here's one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.

(
    df.lazy()
    .with_column(
        pl.concat_str(
            [
                pl.lit("["),
                pl.col("doc_ids")
                .arr.eval(pl.element().cast(pl.Utf8))
                .arr.join(", "),
                pl.lit("]"),
            ]
        ).alias("docs_str")
    )
    .drop("doc_ids")
    .collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ ---      │
│ str ┆ str      │
╞═════╪══════════╡
│ a   ┆ [2, 3]   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ [3]      │
└─────┴──────────┘