I want to read some data from Рadoop directly from spark worker:
So, at spark program I have a hadoop configuration:
val configuration = session.sparkContext.hadoopConfiguration
But I can't use it at worker because it isn't Serializable
:
spark.sparkContext.parallelize(paths).mapPartitions(paths => {
for (path <- paths) yield {
//for example, read the parquet footer
val footer = ParquetFileReader.readFooter(configuration, new Path(path), ParquetMetadataConverter.NO_FILTER)
footer.getFileMetaData.getSchema.getName
}
})
results in
object not serializable (class: org.apache.hadoop.conf.Configuration...