0

We have this massive legacy sql table that we need to extract data out of it and pushing it to s3. Below is how I'm querying a portion of data and writing output.

  def writeTableInParts(tableName: String, numIdsPerParquet: Long, numPartitionsAtATime: Int, startFrom : Long = -1, endTo : Long = -1, filePrefix : String = s3Prefix) = {
    val minId : Long = if (startFrom > 0) startFrom else findMinCol(tableName, "id")
    val maxId : Long = if (endTo > 0) endTo else findMaxCol(tableName, "id")

    (minId until maxId by numIdsPerParquet).toList.sliding(numPartitionsAtATime, numPartitionsAtATime).toList.foreach(list => {
      list.map(start => {
          val end = math.min(start + numIdsPerParquet, maxId)

          sqlContext.read.jdbc(mysqlConStr,
            s"(SELECT * FROM $tableName WHERE id >= ${start} AND id < ${end}) as tmpTable",
            Map[String, String]())
        }).reduce((left, right) => {
          left.unionAll(right)
        })
        .write
        .parquet(s"${filePrefix}/$tableName/${list.head}-${list.last + numIdsPerParquet}")
    })
  }

This has worked well for many different tables but for whatever reason a table continues to get java.nio.channels.ClosedChannelException no matter how much I reduce the scanning window or size.

based on this answer I guess I have exception somewhere in my code but I'm not sure where it would be as it is a rather simple code. How can I further debug this exception? logs didn't have anything very helfpul and doen't reveal the cause.

Community
  • 1
  • 1
jk-kim
  • 1,136
  • 3
  • 12
  • 20

1 Answers1

0

Problem was due to below error, not spark related... It was very cumbersome to chase this down as spark isn't too good at displaying errors. Darn...

'0000-00-00 00:00:00' can not be represented as java.sql.Timestamp error

Community
  • 1
  • 1
jk-kim
  • 1,136
  • 3
  • 12
  • 20