0

I have a dataframe that looks something like this:

df = pl.DataFrame({"group" : ["foo", "bar", "baz"],
                   "elements" : [
                                 pl.arange(0, 100, eager=True), 
                                 pl.arange(200, 300, eager=True), 
                                 pl.arange(300, 400, eager=True)
                                ],
                   "weight": [0.1, 0.5, 0.4]})

print(df)
┌───────┬───────────────────┬────────┐
│ group ┆ elements          ┆ weight │
│ ---   ┆ ---               ┆ ---    │
│ str   ┆ list[i64]         ┆ f64    │
╞═══════╪═══════════════════╪════════╡
│ foo   ┆ [0, 1, … 99]      ┆ 0.1    │
│ bar   ┆ [200, 201, … 299] ┆ 0.5    │
│ baz   ┆ [300, 301, … 399] ┆ 0.4    │
└───────┴───────────────────┴────────┘

How would I sample e.g. 5 elements from each of the lists in the elements column, such that my dataframe looks something like this?

┌───────┬───────────────────────┬────────┐
│ group ┆ elements              ┆ weight │
│ ---   ┆ ---                   ┆ ---    │
│ str   ┆ list[i64]             ┆ f64    │
╞═══════╪═══════════════════════╪════════╡
│ foo   ┆ [7,42,19,74,33]       ┆ 0.1    │
│ bar   ┆ [209,277,222,291,260] ┆ 0.5    │
│ baz   ┆ [300,347,312,398,369] ┆ 0.4    │
└───────┴───────────────────────┴────────┘

If I then wanted to sample a total of 1000 elements from across all groups, weighted according to the weight column, how would I go about doing that?

I've seen this question: Sample from each group in polars dataframe? which I think is probably similar, but so far I haven't been able to come up with the combination of expressions that will work.

Theolodus
  • 2,004
  • 1
  • 23
  • 30
  • For [1] you can [`.eval`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.arr.eval.html#polars.Expr.arr.eval) e.g. `df.with_columns(pl.col("elements").arr.eval(pl.element().sample(5)))` – jqurious Apr 21 '23 at 16:18

2 Answers2

0

It is easiest to explode the arrays so it's all in one working column, then you can do "regular" expressions on them like sample, then implode the result back to list:

df.with_columns(pl.col('elements').arr.explode().sample(5, seed=0).implode().over('group'))

(seed given for certain output in this example):

shape: (3, 3)
┌───────┬───────────────────┬────────┐
│ group ┆ elements          ┆ weight │
│ ---   ┆ ---               ┆ ---    │
│ str   ┆ list[i64]         ┆ f64    │
╞═══════╪═══════════════════╪════════╡
│ foo   ┆ [29, 1, … 10]     ┆ 0.1    │
│ bar   ┆ [229, 201, … 210] ┆ 0.5    │
│ baz   ┆ [329, 301, … 310] ┆ 0.4    │
└───────┴───────────────────┴────────┘

As far as part 2... with the exploded column, without an over you can get an overall sample like so:

pl.col('elements').arr.explode().sample(1000)

But with no weights in the sample method, I don't think this can be done with pure Polars. You could feed an exploded DataFrame into random.choices:

dfe = df.explode('elements')
random.choices(dfe.get_column('elements'), weights=dfe.get_column('weight'), k=1000)
Wayoshi
  • 1,688
  • 1
  • 7
  • It looks like polars isn't randomizing each group but just doing one randomization and broadcasting that to the subsequent groups. See how each row of `elements` starts as 29, 229, 329 respectively. – Dean MacGregor Apr 21 '23 at 20:12
  • Well that's frustrating! Although I can see the logic of why that happens. – Wayoshi Apr 21 '23 at 20:49
0

It seems that when polars does a sample or shuffle by group it doesn't randomize each group, it just broadcasts the randomization from one to the other groups. (At least as of v0.17.6) I think this is a feature as I imagine most datasets are such that the elements go to together although I'm not sure.

If you do a partition_by and a concat you can force it. Another complication is the dynamic sizing that you're looking for since sample only takes an int.

Your example df doesn't have 1000 elements to choose but if you wanted a total of 100 it might look like this:

n=100
(
    pl.concat([
        dd.lazy().with_columns(
            pl.col('elements').arr.explode().shuffle().implode()) 
        for dd in df.partition_by('group')
    ])
    .with_columns(
         elements=pl.col('elements').arr.take(pl.arange(0, pl.col('weight')*n))
    ).collect()
)

If you look at the inner most part of this, you see that it's exploding/shuffling/imploding the elements. That operation is wrapped in a list comprehension whose source is a partition of the original df by group. Because of that, it forces the shuffle to be independent for each group. The last line is taking the, now shuffled, list of elements and takes the first weight * n elements. I don't know if making each of the inner frames lazy actually will help in practice but it adds the potential for parallelism.

Another way to do it is to use numpy for randomness and then filter.

dfe=df.explode('elements')
(
    dfe
        .with_columns(
            rand=pl.lit(np.random.uniform(0,1,dfe.height))
            )
        .sort(['group','rand'])
        .with_columns(j=pl.col('rand').cumcount().over('group')) #cumcount has to be off of a column, choose one arbitrarily
        .filter(pl.col('j')<n*pl.col('weight'))
        .groupby('group')
        .agg(
            pl.col('elements'), 
            pl.col('weight').first()
            )
)

In this way we explode the df saving it to a new variable since we need the new height to make the random column. We then sort by the random variable, create a new j column that represents the in-group index. We then just filter such that the in-group index is less than the threshold and then groupby.agg to get back to the original shape.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72