spark groupBy operation hangs at 199/200

Question

I have a spark standalone cluster with master and two executors. I have an RDD[LevelOneOutput] and below is LevelOneOutput class

class LevelOneOutput extends Serializable {

  @BeanProperty
  var userId: String = _

  @BeanProperty
  var tenantId: String = _

  @BeanProperty
  var rowCreatedMonth: Int = _

  @BeanProperty
  var rowCreatedYear: Int = _

  @BeanProperty
  var listType1: ArrayBuffer[TypeOne] = _

  @BeanProperty
  var listType2: ArrayBuffer[TypeTwo] = _

  @BeanProperty
  var listType3: ArrayBuffer[TypeThree] = _

  ...
  ...

  @BeanProperty
  var listType18: ArrayBuffer[TypeEighteen] = _

  @BeanProperty
  var groupbyKey: String = _
}

Now I want to group this RDD based on userId, tenantId, rowCreatedMonth, rowCreatedYear. For that I did this

val levelOneRDD = inputRDD.map(row => {
  row.setGroupbyKey(s"${row.getTenantId}_${row.getRowCreatedYear}_${row.getRowCreatedMonth}_${row.getUserId}")
  row
})

val groupedRDD = levelOneRDD.groupBy(row => row.getGroupbyKey)

This gives me the data in key as String and value as Iterable[LevelOneOutput]

Now I want to generate one single object of LevelOneOutput for that group key. For that I was doing something like below:

val rdd = groupedRDD.map(row => {
  val levelOneOutput = new LevelOneOutput
  val groupKey = row._1.split("_")

  levelOneOutput.setTenantId(groupKey(0))
  levelOneOutput.setRowCreatedYear(groupKey(1).toInt)
  levelOneOutput.setRowCreatedMonth(groupKey(2).toInt)
  levelOneOutput.setUserId(groupKey(3))

  var listType1 = new ArrayBuffer[TypeOne]
  var listType2 = new ArrayBuffer[TypeTwo]
  var listType3 = new ArrayBuffer[TypeThree]
  ...
  ...
  var listType18 = new ArrayBuffer[TypeEighteen]

  row._2.foreach(data => {
    if (data.getListType1 != null) listType1 = listType1 ++ data.getListType1
    if (data.getListType2 != null) listType2 = listType2 ++ data.getListType2
    if (data.getListType3 != null) listType3 = listType3 ++ data.getListType3
    ...
    ...
    if (data.getListType18 != null) listType18 = listType18 ++ data.getListType18
  })

  if (listType1.isEmpty) levelOneOutput.setListType1(null) else levelOneOutput.setListType1(listType1)
  if (listType2.isEmpty) levelOneOutput.setListType2(null) else levelOneOutput.setListType2(listType2)
  if (listType3.isEmpty) levelOneOutput.setListType3(null) else levelOneOutput.setListType3(listType3)
  ...
  ...
  if (listType18.isEmpty) levelOneOutput.setListType18(null) else levelOneOutput.setListType18(listType18)

  levelOneOutput
})

This is working as expected for small size of input, but when I try to run on the larger set of input data, group by operation is getting hang at 199/200 and I don't see any specific error or warning in stdout/stderr

Can some one point me why the job is not proceeding further...

score 0 · Accepted Answer · answered Aug 05 '17 at 07:39

Instead of using groupBy operation I have created paired RDD like below

val levelOnePairedRDD = inputRDD.map(row => {
  row.setGroupbyKey(s"${row.getTenantId}_${row.getRowCreatedYear}_${row.getRowCreatedMonth}_${row.getUserId}")
  (row.getGroupByKey, row)
})

and updated the processing logic, which solved my issue.

spark groupBy operation hangs at 199/200

1 Answers1