2

I have written a spark job which does below operations

  1. Reads data from HDFS text files.

  2. Do a distinct() call to filter duplicates.

  3. Do a mapToPair phase and generate pairRDD

  4. Do a reducebykey call

  5. do the aggregation logic for grouped tuple.

  6. now call a foreach on #5

    here it does

    1. make a call to cassandra db
    2. create an aws SNS and SQS client connection
    3. do some json record formatting.
    4. publish the record to SNS/SQS

when I run this job it creates three spark stages

first stage - it takes nearly 45 sec . performs a distinct second stage - mapToPair and reducebykey = takes 1.5 mins

third stage = takes 19 mins

what I did

  1. I turned off cassandra call so see DB hit cause - this is taking less time
  2. Offending part I found is to create SNS/SQS connection foreach partition

its taking more than 60% of entire job time

I am creating SNS/SQS Connection within foreachPartition to improve less connections. do we have even better way

I Cannot create connection object on the driver as these are not serializable

I am not using number of executor 9 , executore core 15 , driver memory 2g, executor memory 5g

I am using 16 core 64 gig memory cluster size 1 master 9 slave all same configuration EMR deployment spark 1.6

Community
  • 1
  • 1
Sam
  • 1,333
  • 5
  • 23
  • 36
  • Are you sure `create an aws SNS and SQS client connection ` is taking 60% job time or `publish the record to SNS/SQS` this? There is a slight difference between these two. For the first case, you need to minimize the number of connection creation whereas, for the second case, you need to distribute your data(and creating more connection instance). Interesting!!!! – code Feb 03 '17 at 07:52
  • If it is the second case, I'll post an answer with a solution. – code Feb 03 '17 at 07:53

1 Answers1

1

It sounds like you would want to set up exactly one SNS/SQS connection per node and then use it to process all of your data on each node.

I think foreachPartition is the right idea here, but you might want to coalesce your RDD beforehand. This will collapse partitions on the same node without shuffling, and will allow you to avoid starting extra SNS/SQS connections.

See here: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@coalesce(numPartitions:Int,shuffle:Boolean,partitionCoalescer:Option[org.apache.spark.rdd.PartitionCoalescer])(implicitord:Ordering[T]):org.apache.spark.rdd.RDD[T]

Bradley Kaiser
  • 776
  • 4
  • 16
  • Yes coalesce is exactly my solution. One more point I want to add here. I had many small files like 23kb , 45 kb etc. and with coalesce call it got shrinked to right partition and now I am able to process close to 25gb in 20mins. Improving here more – Sam Feb 05 '17 at 08:18
  • Thanks Bradley.. one more thing.. this was say I need 1TB data to process how many partition with coalesce I should create ? – Sam Feb 05 '17 at 08:20
  • So I would use a number of partitions large enough so that each one will fit into memory, or the number of cores I have. Whichever is greater. – Bradley Kaiser Feb 07 '17 at 15:04