I'll keep to things that work in lazy mode, as it appears that you are working in lazy mode with Parquet files.
Let's use this data as an example:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"col_int": [-2, -2, 0, 2, 2],
"col_float": [-20.0, -10, 10, 20, 20],
"col_date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 5, 1), "1mo"),
"col_str": ["str1", "str2", "", None, "str5"],
"col_bool": [True, False, False, True, False],
}
).lazy()
df.collect()
shape: (5, 5)
┌─────────┬───────────┬─────────────────────┬─────────┬──────────┐
│ col_int ┆ col_float ┆ col_date ┆ col_str ┆ col_bool │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ str ┆ bool │
╞═════════╪═══════════╪═════════════════════╪═════════╪══════════╡
│ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 ┆ str1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ -2 ┆ -10.0 ┆ 2020-02-01 00:00:00 ┆ str2 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 10.0 ┆ 2020-03-01 00:00:00 ┆ ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-04-01 00:00:00 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ str5 ┆ false │
└─────────┴───────────┴─────────────────────┴─────────┴──────────┘
Using the col
Expression
One feature of the col
expression is that you can supply a datatype, or even a list of datatypes. For example, if we want to contain our queries to floats, we can do the following:
df.select((pl.col(pl.Float64) > 0).sum().suffix("__positive_count_")).collect()
shape: (1, 1)
┌────────────────────────────┐
│ col_float__positive_count_ │
│ --- │
│ u32 │
╞════════════════════════════╡
│ 3 │
└────────────────────────────┘
(Note: (pl.col(...) > 0)
creates a series of boolean values that need to be summed, not counted)
To include more than one datatype, you can supply a list of datatypes to col
.
df.select(
(pl.col([pl.Int64, pl.Float64]) > 0).sum().suffix("__positive_count_")
).collect()
shape: (1, 2)
┌──────────────────────────┬────────────────────────────┐
│ col_int__positive_count_ ┆ col_float__positive_count_ │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞══════════════════════════╪════════════════════════════╡
│ 2 ┆ 3 │
└──────────────────────────┴────────────────────────────┘
You can also combine these into the same select statement if you'd like.
df.select(
[
(pl.col(pl.Utf8).str.lengths() == 0).sum().suffix("__empty_count"),
pl.col(pl.Utf8).is_null().sum().suffix("__null_count"),
(pl.col([pl.Float64, pl.Int64]) > 0).sum().suffix("_positive_count"),
]
).collect()
shape: (1, 4)
┌──────────────────────┬─────────────────────┬──────────────────────────┬────────────────────────┐
│ col_str__empty_count ┆ col_str__null_count ┆ col_float_positive_count ┆ col_int_positive_count │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════════════════╪═════════════════════╪══════════════════════════╪════════════════════════╡
│ 1 ┆ 1 ┆ 3 ┆ 2 │
└──────────────────────┴─────────────────────┴──────────────────────────┴────────────────────────┘
The Cookbook has a handy list of datatypes.
Using the exclude
expression
Another handy trick is to use the exclude
expression. With this, we can select all columns except columns of certain datatypes. For example:
df.select(
[
pl.exclude(pl.Utf8).max().suffix("_max"),
pl.exclude([pl.Utf8, pl.Boolean]).min().suffix("_min"),
]
).collect()
shape: (1, 7)
┌─────────────┬───────────────┬─────────────────────┬──────────────┬─────────────┬───────────────┬─────────────────────┐
│ col_int_max ┆ col_float_max ┆ col_date_max ┆ col_bool_max ┆ col_int_min ┆ col_float_min ┆ col_date_min │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ u32 ┆ i64 ┆ f64 ┆ datetime[ns] │
╞═════════════╪═══════════════╪═════════════════════╪══════════════╪═════════════╪═══════════════╪═════════════════════╡
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ 1 ┆ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 │
└─────────────┴───────────────┴─────────────────────┴──────────────┴─────────────┴───────────────┴─────────────────────┘
Unique counts
One caution: unique_counts
results in Series of varying lengths.
df.select(pl.col("col_int").unique_counts().prefix(
"__unique_count_")).collect()
shape: (3, 1)
┌────────────────────────┐
│ __unique_count_col_int │
│ --- │
│ u32 │
╞════════════════════════╡
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└────────────────────────┘
df.select(pl.col("col_float").unique_counts().prefix(
"__unique_count_")).collect()
shape: (4, 1)
┌──────────────────────────┐
│ __unique_count_col_float │
│ --- │
│ u32 │
╞══════════════════════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└──────────────────────────┘
As such, these should not be combined into the same results. Each column/Series of a DataFrame must have the same length.