Tagging "sql" too because an answer that derives a column to partition on with sparkSql would be fine.
Summary: Say I have 3B distinct values of AlmostUID. I don't want 3B partitions, say I want 1000 partitions. But I want all like values of AlmostUID to be on the same partition.
Input:
AlmostUID | LoadMonth |
---|---|
1 | April |
1 | May |
2 | April |
3 | June |
4 | June |
4 | August |
5 | September |
Expected:
"GoodPartition" is good because the records with AlmostUID (1) are on the same partition. Records with AlmostUID (4) are on the same partition.
"BadPartition" is bad because AlmostUID (1) is mapped to multiple different partitions.
AlmostUID | LoadMonth | GoodPartition | BadPartition |
---|---|---|---|
1 | April | 1 | 1 |
1 | May | 1 | 2 |
2 | April | 1 | 1 |
3 | June | 2 | 1 |
4 | June | 2 | 2 |
4 | August | 2 | 2 |
5 | September | 2 | 2 |