3

I have a large (> 500m row) CSV file. Each row in this CSV file contains a path to a binary file located on HDFS. I want to use Spark to read each of these files, process them, and write out the results to another CSV file or a table.

Doing this is simple enough in the driver, and the following code gets the job done

val hdfsFilePathList = // read paths from CSV, collect into list

hdfsFilePathList.map( pathToHdfsFile => {
  sqlContext.sparkContext.binaryFiles(pathToHdfsFile).mapPartitions { 
    functionToProcessBinaryFiles(_)
  }
})

The major problem with this is that the driver is doing too much of the work. I would like to farm out the work done by binaryFiles to the executors. I've found some promising examples that I thought would allow me to access the sparkContext from an executor:

Use SparkContext hadoop configuration within RDD methods/closures, like foreachPartition

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala

but they don't seem to work the way I thought they would. I'd expect the following to work:

import java.io.{ObjectInputStream, ObjectOutputStream}
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration

class ConfigSerDeser(var conf: Configuration) extends Serializable {

  def this() {
    this(new Configuration())
  }

  def get(): Configuration = conf

  private def writeObject (out: java.io.ObjectOutputStream): Unit = {
    conf.write(out)
  }

  private def readObject (in: java.io.ObjectInputStream): Unit = {
    conf = new Configuration()
    conf.readFields(in)
  }

  private def readObjectNoData(): Unit = {
    conf = new Configuration()
  }
}

val serConf = new ConfigSerDeser(sc.hadoopConfiguration)

val mappedIn = inputDf.map( row => {
    serConf.get()
})

But it fails with KryoException: java.util.ConcurrentModificationException

Is it possible to have to executors access HDFS files or the HDFS file system directly? Or alternatively, is there an efficient way to read millions of binary files on HDFS/S3 and process them with Spark?

  • Try rdd.foreachPartitionAsync() where rdd is input rdd which has path details. (I am not sure if it will solve your problem) – Bishnu Apr 01 '19 at 19:17

1 Answers1

0

There was a similar use case where i was trying to do the same, but realised SparkSession or SparkContext is not serializable thus cannot be accessed from executors.

Bishnu
  • 383
  • 4
  • 14