7

I'm looking for a function along the lines of

df.groupby('column').agg(sample(10))

so that I can take ten or so randomly-selected elements from each group.

This is specifically so I can read in a LazyFrame and work with a small sample of each group as opposed to the entire dataframe.

Update:

One approximate solution is:

df = lf.groupby('column').agg(
        pl.all().sample(.001)
    )
df = df.explode(df.columns[1:])

Update 2

That approximate solution is just the same as sampling the whole dataframe and doing a groupby after. No good.

3 Answers3

8

Let start with some dummy data:

n = 100
seed = 0
df = pl.DataFrame(
    {
        "groups": (pl.int_range(0, n, eager=True) % 5).shuffle(seed=seed),
        "values": pl.int_range(0, n, eager=True).shuffle(seed=seed)
    }
)
df
shape: (100, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ i64    ┆ i64    │
╞════════╪════════╡
│ 0      ┆ 55     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0      ┆ 40     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 57     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 99     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 87     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1      ┆ 96     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3      ┆ 43     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 44     │
└────────┴────────┘

This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:

df.groupby("groups").agg(pl.count())
shape: (5, 2)
┌────────┬───────┐
│ groups ┆ count │
│ ---    ┆ ---   │
│ i64    ┆ u32   │
╞════════╪═══════╡
│ 1      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0      ┆ 20    │
└────────┴───────┘

Sample our data

Now we are going to use a window function to take a sample of our data.

df.filter(
    pl.int_range(0, pl.count()).shuffle().over("groups") < 10
)
shape: (50, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ i64    ┆ i64    │
╞════════╪════════╡
│ 0      ┆ 85     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0      ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 84     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 19     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 87     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1      ┆ 96     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3      ┆ 43     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 44     │
└────────┴────────┘

For every group in over("group") the pl.int_range(0, pl.count()) expression creates an index row. We then shuffle that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask that we can pass to the filter method.

pedrosaurio
  • 4,708
  • 11
  • 39
  • 53
ritchie46
  • 10,405
  • 1
  • 24
  • 43
1

A solution using lambda

df = (
    lf.groupby('column')
    .apply(lambda x: x.sample(10))
    )

pedrosaurio
  • 4,708
  • 11
  • 39
  • 53
0

We can try making our own groupby-like functionality and sampling from the filtered subsets.

samples = []
cats = df.get_column('column').unique().to_list()
for cat in cats:
    samples.append(df.filter(pl.col('column') == cat).sample(10))
samples = pl.concat(samples)

Found partition_by in the documentation, this should be more efficient, since at least the groups are made with the api and in single pass of the dataframe. Sampling each group is still linear unfortunately.

pl.concat([x.sample(10) for x in df.partition_by(groups="column")])

Third attempt, sampling indices:

import numpy as np
import random

indices = df.groupby("group").agg(pl.col("value").agg_groups()).get_column("value").to_list()
sampled = np.array([random.sample(x, 10) for x in indices]).flatten()
df[sampled]
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • Not a bad idea. Will give it a try. –  Jun 15 '22 at 18:03
  • 1
    The issue is, you have to go through the dataframe each time every time you filter. If you have hundreds or thousands of categories, you pay a huge price for this. –  Jun 15 '22 at 18:04
  • It seems like you can use `df.groupby("group").agg(pl.col("value").agg_groups())` to get indexes from each group all in one operation. Then it remains to sample from those and gather all the groups (and it of course returns too many indexes). – creanion Jun 15 '22 at 18:14
  • Updated my answer with what I think should be a slightly more efficient method using [partition_by](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.partition_by.html#polars-dataframe-partition-by) – BeRT2me Jun 15 '22 at 18:15
  • @creanion that's right, but it just kicks the can down to sampling from one of the lists now. –  Jun 15 '22 at 18:21
  • Sampling a list of indexes and then only gathering the rows selected for the sample should be a step up in performance – creanion Jun 15 '22 at 18:23
  • @creanion I agree, let's see if I can work a solution. –  Jun 15 '22 at 18:28
  • I made an attempt at it as well~ – BeRT2me Jun 15 '22 at 18:58