2

I am trying to apply a simple value_counts() to multiple columns on a dataframe in polars but getting error.

import polars as pl
import pandas as pd

data:

sample_df = pl.DataFrame({'sub-category': ['tv','mobile','tv','wm','micro','wm'],
              'category': ['electronics','mobile','electronics','electronics','kitchen','electronics']})

Failed Attempts:

#1
sample_df.apply(value_counts())

#2
sample_df.apply(lambda x: x.value_counts())

#3
sample_df.apply(lambda x: x.to_series().value_counts())

#4
sample_df.select(pl.col(['sub-category','category'])).apply(lambda x: x.value_counts())

#5
sample_df.select(pl.col(['sub-category','category'])).apply(lambda x: x.to_series().value_counts())

But if I convert it to Pandas dataframe then it works:

sample_df.to_pandas().apply(lambda x: x.value_counts())
ViSa
  • 1,563
  • 8
  • 30
  • What output are you trying to achieve? Is it an invidual count per column? Something like: `sample_df.melt().select(pl.struct(pl.all()).value_counts())` – jqurious May 05 '23 at 11:37
  • @jqurious I am trying to get values for individual counts for each columns like this `pandas` code `sample_df.to_pandas().apply(lambda x: x.value_counts())` gives me that sort of output – ViSa May 05 '23 at 12:07

4 Answers4

4

You could .melt + .groupby().count()

df.melt(variable_name="column").groupby(pl.all()).count()
shape: (7, 3)
┌──────────────┬─────────────┬───────┐
│ column       ┆ value       ┆ count │
│ ---          ┆ ---         ┆ ---   │
│ str          ┆ str         ┆ u32   │
╞══════════════╪═════════════╪═══════╡
│ sub-category ┆ mobile      ┆ 1     │
│ category     ┆ kitchen     ┆ 1     │
│ sub-category ┆ wm          ┆ 2     │
│ sub-category ┆ tv          ┆ 2     │
│ sub-category ┆ micro       ┆ 1     │
│ category     ┆ mobile      ┆ 1     │
│ category     ┆ electronics ┆ 4     │
└──────────────┴─────────────┴───────┘

You can add .pivot for a long-to-wide reshape.

df.melt().groupby(pl.all()).count().pivot(
   values = "count",
   index = "value",
   columns = "column",
   aggregate_function = None
)
shape: (6, 3)
┌─────────────┬──────────────┬──────────┐
│ value       ┆ sub-category ┆ category │
│ ---         ┆ ---          ┆ ---      │
│ str         ┆ u32          ┆ u32      │
╞═════════════╪══════════════╪══════════╡
│ mobile      ┆ 1            ┆ 1        │
│ wm          ┆ 2            ┆ null     │
│ micro       ┆ 1            ┆ null     │
│ kitchen     ┆ null         ┆ 1        │
│ electronics ┆ null         ┆ 4        │
│ tv          ┆ 2            ┆ null     │
└─────────────┴──────────────┴──────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • I edited the pivot parameter order so that it matches the order that you'd use if you weren't using named arguments, ie `pivot('counts','value','variable','first')` – Dean MacGregor May 05 '23 at 13:26
1

So if you want the same result like in the pandas case you have to do a small work around, which I think pandas does implicit since it is based on an index.

So you create for each column a value count and then merge them with an outer join.

def multiple_column_value_counts(df, columns, value_column = "values"):
    value_counts_dfs = []
    for column in columns:
        counts = df.get_column(column).value_counts()

        name_map = {k: v for k, v in zip(counts.columns, [value_column, column])}
        value_counts_dfs.append(counts.rename(name_map))

    merge_df = value_counts_dfs.pop(0)

    for value_counts in value_counts_dfs:
        merge_df = merge_df.join(value_counts, on = value_column, how = "outer")
    return(merge_df)


columns = ['category', 'sub-category']

multiple_column_value_counts(sample_df, columns)



shape: (6, 3)
┌─────────────┬──────────┬──────────────┐
│ values      ┆ category ┆ sub-category │
│ ---         ┆ ---      ┆ ---          │
│ str         ┆ u32      ┆ u32          │
╞═════════════╪══════════╪══════════════╡
│ micro       ┆ null     ┆ 1            │
│ mobile      ┆ 1        ┆ 1            │
│ wm          ┆ null     ┆ 2            │
│ tv          ┆ null     ┆ 2            │
│ kitchen     ┆ 1        ┆ null         │
│ electronics ┆ 4        ┆ null         │
└─────────────┴──────────┴──────────────┘
alexp
  • 697
  • 4
  • 9
  • thanks @alexp for sharing the solution but this quite large & complex way of `polars` to do something that `pandas` can do in just 1 line of code: - `sample_df.to_pandas()[['sub-category','category']].apply(lambda x: x.value_counts())` . And I agree with you but will still wait to see if there could be any other better way to do it. – ViSa May 05 '23 at 12:17
1

TLDR: one line answer at the bottom

I think it's worth pointing out that polars is not something built on top of pandas or with any intent of sharing its API and syntax. Additionally, while pandas makes extensive use of df.apply, in polars that is almost always suboptimal if not just wrong.

To your question. polars does have a value_counts method, although it returns a struct which is another thing that is very different from pandas.

Let's say we have

df=pl.from_repr("""shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 1   ┆ 2   │
│ 2   ┆ 1   │
│ 2   ┆ 2   │
│ 3   ┆ 1   │
│ 3   ┆ 2   │
└─────┴─────┘""")

then if you just wanted to go one column at a time you could do

df.select(pl.col('a').value_counts()).unnest('a')
shape: (3, 2)
┌─────┬────────┐
│ a   ┆ counts │
│ --- ┆ ---    │
│ i64 ┆ u32    │
╞═════╪════════╡
│ 1   ┆ 2      │
│ 3   ┆ 2      │
│ 2   ┆ 2      │
└─────┴────────┘

There's not a great way of making that work for all columns. You could do:

f= lambda x:df.select(pl.col(x).value_counts()).unnest(x).rename({x:'value','counts':x}).lazy()
final=f(df.columns[-1])
for x in df.columns[:-1][::-1]:
    final=f(x).join(final, on='value',how='outer')
final.collect()
shape: (3, 3)
┌───────┬─────┬──────┐
│ value ┆ a   ┆ b    │
│ ---   ┆ --- ┆ ---  │
│ i64   ┆ u32 ┆ u32  │
╞═══════╪═════╪══════╡
│ 1     ┆ 2   ┆ 3    │
│ 2     ┆ 2   ┆ 3    │
│ 3     ┆ 2   ┆ null │
└───────┴─────┴──────┘

definitely awkward although this is probably more performant than the next approach which is much more succinct to type.

df.melt().groupby('variable','value').count().pivot('count','value','variable','first')
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • Thanks for explaining the things with single column value_counts struct `unnest` and the working with whole dataframe. Appreciate your guidance on the same :) – ViSa May 06 '23 at 08:27
0

I was led to believe it is generally best to avoid apply and use standard python such as list comprehensions to loop over columns - particularly if the frame is very large. E.g. my solution for the above would be something like:

counts=(pl.concat([sample_df.select(pl.col(col).value_counts())
               .unnest(col).rename({col:"value"})
               .with_columns(column=pl.lit(col)) for col in 
sample_df.columns])
)