Reading binary InputStream in Spark produces wrong results

Question

I am designing a Spark job in order to:

Parse a binary file that comes inside a .tar.gz file
Create a Dataframe with POJOs extracted from the byte array
Store them in parquet

For the parsing of the binary file, I am using some legacy Java code that reads fixed-length fields from the byte array. This code works when I execute the code as part of a regular JVM process in my laptop.

However, when I upload the same file to HDFS and try to read it from Spark, the fixed-length reading of the fields fails as I never get the fields that the Java code expects.

Standalone code used successfully:

// This is a local path in my laptop
val is = new GZIPInputStream(new FileInputStream(basepath + fileName))
val reader = new E4GTraceFileReader(is,fileName)

// Here I invoke the legacy Java code
// The result here is correct
val result = reader.readTraces()

Spark Job:

val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())

val hdfsFiles = spark.sparkContext.parallelize(hdfs.listStatus(new Path("SOME_PATH")).map(_.getPath))

// Create Input Stream from each file in the folder
val inputStreamsRDD =  hdfsFiles.map(x =>{
  val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
  (hdfs.open(x).getWrappedStream,x)
})

// Read the InputStream into a byte[]
val tracesRDD = inputStreamsRDD.flatMap(x => readTraceRecord(x._1,x._2)).map(flattenPOJO)

private def readTraceRecord(is : InputStream, fileName: Path) : List[E4GEventPacket] = {
 println(s"Starting to read ${fileName.getName}")
 val reader = new E4GTraceFileReader(is, fileName.getName)
 reader.readTraces().asScala.toList

}

I have tried both using the FSDataInputStream returned by hdfs.open as well as hdfs.open(x).getWrappedStream but I don´t get the expected result. I don´t know if I should paste the legacy Java code here as it is a bit lenghty, however I clearly fails to get the expected fields.

Do you think that the problem here is the serialization done in Spark from the driver program to the executors, which causes the data to be somehow corrupted?

I have tried using both YARN as well as local[1] but I get the same results.

Spark transformations (map, flatMap) are executed on executors side so the data is not moved between driver and them. So normally it shouldn't be a serialization issue. Could you check that by executing the code reading data from HDFS locally: val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration()) val inputStream = hdfs.open(x).getWrappedStream readTraceRecord(inputStream) ? Maybe something went wrong here ? [Sorry for bad formatting, it didn't apply code formatting in the comment] — Bartosz Konieczny, Jul 18 '18 at 11:12
The binary InputStream goest to a (text?) E4GTraceFileReader, hence it could be the encoding to convert the binary bytes to "text." And at the moment the default encoding is used which on Windows works, as there is a single byte encoding. - *of course this is a wild unfounded guess.* — Joop Eggen, Jul 18 '18 at 11:15
Thanks for the comments! I think I will follow the line of research you propose and try to run it in Windows (this question on SO for that is great, btw: https://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows) — Victor, Jul 18 '18 at 18:04
Found the problem... Within the Spark code, I was not creating a ´GZInputStream´ as wrapper over the underlying ´InputStream´ ;( — Victor, Jul 18 '18 at 20:44

Reading binary InputStream in Spark produces wrong results

0 Answers0