I have a largish dataframe (5.5M rows, four columns). The first column (let's call it column A) has 235 distinct entries. The second column (B) has 100 distinct entries, integers from 0 to 99, all present in various proportions for each entry in A. I groupby on A an aggregate by randomly selecting a value from B. Something like
df.groupby("A").agg(
pl.col("B").unique().apply(np.random.choice)
)
In doing so, every value in A is attributed a random integer. My goal is to select, from the dataframe, the columns C et D corresponding to the pairs A, B so generated.
My approach so far is
choice = df.groupby("A").agg(
pl.col("B").unique().apply(np.random.choice)
).to_numpy()
lchoice = ((df["A"] == arr[0]) & (df["B"] == arr[1]) for arr in choice)
mask = functools.reduce(operator.or_, lchoice)
sdf = df[mask].select(["C", "D"])
It does the job, but does not feel very idiomatic.
My first attempt was
sdf = df.filter(functools
.reduce(operator.or_,
[(pl.col("A") == arr[0]) & (pl.col("B") == arr[1])
for arr in choice]))
but it hangs until I kill it (I waited for ~30 minutes, while the other approach takes 1.6 seconds).
df.filter(
(pl.col("period") == choice[0, 0]) & (pl.col("exp_id") == choice[0, 1])
)
works fine, as expected, and I have used the functools.reduce
construct successfully as argument to filter in the past. Obviously, I do not want to write them all by hand; I could loop over the rows of choice
, filter df
one at a time and then concatenate the dataframes, but it sounds much more expensive than it should be.
Any tip on getting to my sdf
"the polars way", without having to create temporary objects, arrays, etc.? As I said, I have a working solution, but it is kind of shaky, and I am interested in learning better polars.
EDIT: some mock data
df = pl.DataFrame({"A": [1.3, 8.9, 6.7]*3 + [3.6, 4.1]*2,
"B": [1]*3 + [2]*3 + [3]*3 + [1]*2 + [2]*2,
"C": [21.5, 24.3, 21.8, 20.8, 23.6, 15.6, 23.5,
16.1, 15.6, 14.8, 14.7, 23.8, 20.],
"D": [6.9, 7.6, 6.4, 6.2, 7.6, 6.2,
6.3, 7.1, 7.8,7.7, 6.5, 6.6, 7.1]})
Slight twist on the accepted answer:
df.sort(by=['A', 'B'], in_place=True)
sdf = (df
.join(df
.groupby('A', maintain_order=True)
.agg(pl.col('B')
.unique()
.sort()
.shuffle(seed)
.first()
.alias('B')),
on=['A', 'B'])
.select(['C','D']))
I need to perform this operation multiple time, and I'd like to ensure reproducibility of the random generation, hence the sorts and maintain_order=True
.