I would like to sample at most n rows from each group in the data, where the grouping is defined by a single column. There are many answers for selecting the top n rows, but I dont't need order and am not sure whether ordering would not introduce unnecessary shuffling.
I have looked at
- sampleBy(), but I don't need a fraction but a maximal absolute amount of rows.
- Windows, but they always seem to imply ordering the values
- groupBy, but was not able to construct something of the available aggregate functions.
Code example:
data = [('A',1), ('B',1), ('C',2)]
columns = ["field_1","field_2"]
df = spark.createDataFrame(data=data, schema = columns)
Where I would be looking for a pandas-like
df.groupby('field_2').head(1)
I would also be happy with a suitable SQL expression.
Otherwise if there is no better performance than using
Window.partitionBy(df['field_2']).orderBy('field_1')...
then I'd also be happy to know that. Thanks!