0

I have a performance problem with a code I'm revisioning, everytime will give an OOM while performing a count. I think I found the problem, basically after keyBy tranformation, being executed aggregateByKey. The problem lies to the fact that almost 98% of the RDD elements has the same key, so aggregationByKey, generate shuffle, putting nearly all records into the same partition, bottom line: just few executors works, and has to much memory pressure.

This is the code:

val rddAnomaliesByProcess : RDD[AnomalyPO] = rddAnomalies
    .keyBy(po => po.getProcessCreator.name)
    .aggregateByKey(List[AnomalyPO]())((list,value) => value +: list,_++_)
    .map {case(name,list) =>
      val groupByKeys = list.groupBy(po => (po.getPodId, po.getAnomalyCode, po.getAnomalyReason, po.getAnomalyDate, po.getMeasureUUID))
      val lastOfGroupByKeys = groupByKeys.map{po => (po._1, List(po._2.sortBy { po => po.getProcessDate.getMillis }.last))}
      lastOfGroupByKeys.flatMap(f => f._2)
    }
    .flatMap(f => f)

log.info("not duplicated Anomalies: " + rddAnomaliesByProcess.count)

I would a way to perform operation in a more parallel way, allowing all executors to work nearly equally. How can I do that?

Should I have to use a custom partitioner?

Giorgio
  • 1,073
  • 3
  • 15
  • 33
  • *"The problem lies to the fact that almost 98% of the RDD elements has the same key"* Is there a reason so many elements have the same key? Is that a business requirement? – Yuval Itzchakov Jan 16 '17 at 15:56
  • 1
    Actually I don't know, I don't have any functional knowledge, I'm just trying to find performance bottleneck. I must consider that they had thought about it and partitioning is right. – Giorgio Jan 16 '17 at 16:00
  • Perhaps if the key generation was better, and ideally uniform, you wouldn't have a problem where a single partition was so large. – Yuval Itzchakov Jan 16 '17 at 16:00
  • But as you can see, they do also an aggregateByKey, so I should trying to solve without changing code, other than partioning transformations. – Giorgio Jan 16 '17 at 16:03
  • Yes , is a very bad piece of code, no doubt about it. I would like Just ti know if there was any quick and dirty solution applicable without changing program Logic and I was thinking if partitioning could solve or at least alleviate performance problems – Giorgio Jan 16 '17 at 16:21

1 Answers1

1

If your observation is correct and

98% of the RDD elements has the same key

then change of partitioner won't help you at all. By the definition of the partitioner 98% of the data will have to be processed by a single executor.

Luckily bad code is probably the bigger problem here than the skew. Skipping over:

.aggregateByKey(List[AnomalyPO]())((list,value) => value +: list,_++_)

which is just a folk magic it looks like the whole pipeline can be rewritten as a simple reuceByKey. Pseudocode:

  • Combine name and local keys into a single key:

    def key(po: AnomalyPO) = (
      // "major" key
      po.getProcessCreator.name, 
      // "minor" key
      po.getPodId, po.getAnomalyCode,
      po.getAnomalyReason, po.getAnomalyDate, po.getMeasureUUID
    )
    

    Key containing name, date and additional fields should have much higher cardinality than the name alone.

  • Map to pairs and reduce by key:

    rddAnomalies
      .map(po => (key(po), po))
      .reduceByKey((x, y) => 
        if(x.getProcessDate.getMillis > y.getProcessDate.getMillis) x else y
      )
    
zero323
  • 322,348
  • 103
  • 959
  • 935