Partition multiple table based on primary key using Apache spark or any big data tool

Question

I have a data of 75 e-commerce customer account data in a csv file.

Also, I have transaction records in another file. Here, Account number is a primary key. Every account is having average 500 transactions.

Now, I want to process this data and make some decision about giving promotional offers. As amount of data is very huge, I decided to go for SparkSQL.

But, the problem is, when I will join this two tables, there will be a lots of shuffling between Cluster nodes. I want to avoid this clustering.

For that if I can ensure that an account's' data on the same partition as transaction data. How can I do that ?

Temporary solution is, I can divide 75 million accounts in 75 files, 1 million account each. and get their transactions in similar fashion. and then spin up 75 spark instances to process them all. Is there any other way to do this ?

score 1 · Answer 1 · answered Sep 28 '17 at 09:41

Transaction and account details are different dataframe and can't be in the same partition.

However you can use hive bucketing to reduce shuffling. You can save both the files bucketBy the accountId (Maybe apply sorting also). That way when you do a join spark won't do shuffle.

To better understand about hive bucketing with Spark 2.0 Please check this

Partition multiple table based on primary key using Apache spark or any big data tool

1 Answers1