I have written a spark job which does below operations
Reads data from HDFS text files.
Do a distinct() call to filter duplicates.
Do a mapToPair phase and generate pairRDD
Do a reducebykey call
do the aggregation logic for grouped tuple.
now call a foreach on #5
here it does
- make a call to cassandra db
- create an aws SNS and SQS client connection
- do some json record formatting.
- publish the record to SNS/SQS
when I run this job it creates three spark stages
first stage - it takes nearly 45 sec . performs a distinct second stage - mapToPair and reducebykey = takes 1.5 mins
third stage = takes 19 mins
what I did
- I turned off cassandra call so see DB hit cause - this is taking less time
- Offending part I found is to create SNS/SQS connection foreach partition
its taking more than 60% of entire job time
I am creating SNS/SQS Connection within foreachPartition to improve less connections. do we have even better way
I Cannot create connection object on the driver as these are not serializable
I am not using number of executor 9 , executore core 15 , driver memory 2g, executor memory 5g
I am using 16 core 64 gig memory cluster size 1 master 9 slave all same configuration EMR deployment spark 1.6