I am designing a Spark job in order to:
- Parse a binary file that comes inside a .tar.gz file
- Create a Dataframe with POJOs extracted from the byte array
- Store them in parquet
For the parsing of the binary file, I am using some legacy Java code that reads fixed-length fields from the byte array. This code works when I execute the code as part of a regular JVM process in my laptop.
However, when I upload the same file to HDFS and try to read it from Spark, the fixed-length reading of the fields fails as I never get the fields that the Java code expects.
Standalone code used successfully:
// This is a local path in my laptop
val is = new GZIPInputStream(new FileInputStream(basepath + fileName))
val reader = new E4GTraceFileReader(is,fileName)
// Here I invoke the legacy Java code
// The result here is correct
val result = reader.readTraces()
Spark Job:
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
val hdfsFiles = spark.sparkContext.parallelize(hdfs.listStatus(new Path("SOME_PATH")).map(_.getPath))
// Create Input Stream from each file in the folder
val inputStreamsRDD = hdfsFiles.map(x =>{
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
(hdfs.open(x).getWrappedStream,x)
})
// Read the InputStream into a byte[]
val tracesRDD = inputStreamsRDD.flatMap(x => readTraceRecord(x._1,x._2)).map(flattenPOJO)
private def readTraceRecord(is : InputStream, fileName: Path) : List[E4GEventPacket] = {
println(s"Starting to read ${fileName.getName}")
val reader = new E4GTraceFileReader(is, fileName.getName)
reader.readTraces().asScala.toList
}
I have tried both using the FSDataInputStream
returned by hdfs.open
as well as hdfs.open(x).getWrappedStream
but I don´t get the expected result.
I don´t know if I should paste the legacy Java code here as it is a bit lenghty, however I clearly fails to get the expected fields.
Do you think that the problem here is the serialization done in Spark from the driver program to the executors, which causes the data to be somehow corrupted?
I have tried using both YARN as well as local[1] but I get the same results.