save dataframe with records limit but also make sure same value is not across multiple files

Question

suppose I have this dataframe:

id	value
A	1
A	2
A	3
B	1
B	2
C	1
D	1
D	2

and so on. basically I want to make sure even with records limit any certain id can only appear in one single file(suppose number of entries with that id is less than the limit)

say I am trying to output as csv with records limit:

df.write.option("maxRecordsPerFile", 4).csv(path)

what turns out is that id B may appear in 2 different CSVs, which I want to avoid,

is there a way to ensure? thanks

Not an exact solution, but probably still helpful: [Link](https://stackoverflow.com/a/37510415/2129801) — werner, Mar 09 '23 at 20:03

score 1 · Accepted Answer · answered Mar 10 '23 at 10:59

1

You could ensure that all records with the same id end up in the same file with repartition and partitionBy. In that case, you will have one file per id which respects you constraints.

df.repartition($"id").write.partitionBy("id").csv(path)

If you want to reduce the number of files, you can simply use repartition without partitionBy. In that case, records with the same id will necessarily end up in the same file but there will be collisions. Note that in that case, you cannot really control the maximum size of a file, only the average size of each file. Let's say that we have n records and that we want an average file size of s, we could do the following:

df.repartition(n / s, $"id").write.csv(path)

answered Mar 10 '23 at 10:59

Oli

9,766
5
25
46

thanks, yea seems I need to loose the constraint and just specify average file size to avoid same id splitting across different files. in the first example you mentioned there will be collisions, what is that collision? – ForkPork Mar 10 '23 at 17:36
with the first example, there actually won't be collisions (same ids in the same file). The first operation, `repartition` will possibly generate collisions while making sure that all rows of each given `id` end up in the same partition. `partitionBy` will then create one file per partition and per id so in the end, one file = one id so no collisions. – Oli Mar 11 '23 at 08:36
1

Collisions = different id getting the same hash with repartition and thus getting into the same partition. I made a typo in the comment and cannot edit it anymore – Oli Mar 11 '23 at 10:52

save dataframe with records limit but also make sure same value is not across multiple files

1 Answers1