0

I would like to calculate the standard deviation of dataframe row for the columns 'foo' and 'bar'.

I am able to find min,max and mean but not std.

import polars as pl

df = pl.DataFrame(

    {

        "foo": [1, 2, 3],

        "bar": [6, 7, 8],

        "ham": ["a", "b", "c"],

    }

)

#finding the sum works for me, the same code works for min and max as well.

df = df.select(pl.col('*'),\
        df.select(pl.col(['foo','bar']))\
            .sum(axis=1)\
            .apply(lambda x:round(x,2))\
            .alias('sum'))


however, the below code throws an error when trying to calculate the standard deviation as the std function does not have axis argument available.

df = df.select(pl.col('*'),\
        df.select(pl.col(['foo','bar']))\
            .std(axis=1)\
            .apply(lambda x:round(x,2))\
            .alias('std'))

Is there any better method available to compute standard deviation in such scenario ?

1 Answers1

0

In polars, axis=1 is covered under: Row wise computations.

See also: https://stackoverflow.com/a/71951543

df = pl.from_repr("""
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
│ 2   ┆ 7   ┆ b   │
│ 3   ┆ 8   ┆ c   │
│ 4   ┆ 9   ┆ a   │
└─────┴─────┴─────┘
""")
df.with_columns(
   sum = pl.concat_list("foo", "bar").arr.eval(pl.element().sum()).arr.first(),
   std = pl.concat_list("foo", "bar").arr.eval(pl.element().std()).arr.first()
)
shape: (4, 5)
┌─────┬─────┬─────┬─────┬──────────┐
│ foo ┆ bar ┆ ham ┆ sum ┆ std      │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ str ┆ i64 ┆ f64      │
╞═════╪═════╪═════╪═════╪══════════╡
│ 1   ┆ 6   ┆ a   ┆ 7   ┆ 3.535534 │
│ 2   ┆ 7   ┆ b   ┆ 9   ┆ 3.535534 │
│ 3   ┆ 8   ┆ c   ┆ 11  ┆ 3.535534 │
│ 4   ┆ 9   ┆ a   ┆ 13  ┆ 3.535534 │
└─────┴─────┴─────┴─────┴──────────┘

summing is also available via .arr.sum() and pl.sum()

pl.concat_list("foo", "bar").arr.sum()

pl.sum(["foo", "bar"])
jqurious
  • 9,953
  • 1
  • 4
  • 14