I have a data of 75 e-commerce customer account data in a csv file.
Also, I have transaction records in another file. Here, Account number is a primary key. Every account is having average 500 transactions.
Now, I want to process this data and make some decision about giving promotional offers. As amount of data is very huge, I decided to go for SparkSQL.
But, the problem is, when I will join this two tables, there will be a lots of shuffling between Cluster nodes. I want to avoid this clustering.
For that if I can ensure that an account's' data on the same partition as transaction data. How can I do that ?
Temporary solution is, I can divide 75 million accounts in 75 files, 1 million account each. and get their transactions in similar fashion. and then spin up 75 spark instances to process them all. Is there any other way to do this ?