Stratified Sampling of a DataFrame

Question

Given a dataframe with columns "a", "b", and "value", I'd like to sample N rows from each pair of ("a", "b"). In python pandas, this is easy to do with the following syntax:

import pandas as pd
df.groupby(["a", "b"]).sample(n=10)

In Julia, I found a way to achieve something similar using:

using DataFrames, StatsBase

combine(groupby(df, [:a, :b]),
names(df) .=> sample .=> names(df)
)

However, I don't know how to extend this to n>1. I tried

combine(groupby(df, [:a, :b]),
names(df) .=> x -> sample(x, n) .=> names(df)
)

but this returned the error (for n=3):

DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 7")

One method (with slightly different syntax) that I found was:

combine(groupby(df, [:a, :b]), x -> x[sample(1:nrow(x), n), :])

but I'm interested in knowing if there are better alternatives

The last example you gave is a normal way to do it. Note that in this case it is enough to use `rand` instead of `sample`. `sample` would be needed if you wanted sampling without replacement. — Bogumił Kamiński, Jun 12 '21 at 18:48

score 3 · Accepted Answer · answered Jun 12 '21 at 19:27

Maybe as an additional comment. If you have an id column in your data frame (holding a row number) then:

df[combine(groupby(df, [:a, :b]), :id => (x -> rand(x, n)) => :id).id, :]

will be a bit faster (but not by much).

Here is an example:

using DataFrames
n = 10
df = DataFrame(a=rand(1:1000, 10^8), b=rand(1:1000, 10^8), id=1:10^8)
combine(groupby(df, [:a, :b]), x -> x[rand(1:nrow(x), n), :]); # around 16.5 seconds on my laptop
df[combine(groupby(df, [:a, :b]), :id => (x -> rand(x, n)) => :id).id, :]; # around 14 seconds on my laptop

Stratified Sampling of a DataFrame

1 Answers1