I have a large (> 500m row) CSV file. Each row in this CSV file contains a path to a binary file located on HDFS. I want to use Spark to read each of these files, process them, and write out the results to another CSV file or a table.
Doing this is simple enough in the driver, and the following code gets the job done
val hdfsFilePathList = // read paths from CSV, collect into list
hdfsFilePathList.map( pathToHdfsFile => {
sqlContext.sparkContext.binaryFiles(pathToHdfsFile).mapPartitions {
functionToProcessBinaryFiles(_)
}
})
The major problem with this is that the driver is doing too much of the work. I would like to farm out the work done by binaryFiles
to the executors. I've found some promising examples that I thought would allow me to access the sparkContext from an executor:
Use SparkContext hadoop configuration within RDD methods/closures, like foreachPartition
but they don't seem to work the way I thought they would. I'd expect the following to work:
import java.io.{ObjectInputStream, ObjectOutputStream}
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
class ConfigSerDeser(var conf: Configuration) extends Serializable {
def this() {
this(new Configuration())
}
def get(): Configuration = conf
private def writeObject (out: java.io.ObjectOutputStream): Unit = {
conf.write(out)
}
private def readObject (in: java.io.ObjectInputStream): Unit = {
conf = new Configuration()
conf.readFields(in)
}
private def readObjectNoData(): Unit = {
conf = new Configuration()
}
}
val serConf = new ConfigSerDeser(sc.hadoopConfiguration)
val mappedIn = inputDf.map( row => {
serConf.get()
})
But it fails with KryoException: java.util.ConcurrentModificationException
Is it possible to have to executors access HDFS files or the HDFS file system directly? Or alternatively, is there an efficient way to read millions of binary files on HDFS/S3 and process them with Spark?