suppose I have this dataframe:
id | value |
---|---|
A | 1 |
A | 2 |
A | 3 |
B | 1 |
B | 2 |
C | 1 |
D | 1 |
D | 2 |
and so on. basically I want to make sure even with records limit any certain id can only appear in one single file(suppose number of entries with that id is less than the limit)
say I am trying to output as csv with records limit:
df.write.option("maxRecordsPerFile", 4).csv(path)
what turns out is that id B may appear in 2 different CSVs, which I want to avoid,
is there a way to ensure? thanks