Pyspark has a function sequenceFile that allows us to read a sequence file which is stored in HDFS or some local path available to all nodes.
However, what if I already have a bytes object in the driver memory that I need to deserialize and write as a sequence file?
For example, the application that I am working on ( I cannot change the application logic) runs a spark job that writes this file to a non HDFS compliant file system, which i can then retrieve as an in-memory python bytes object , which seems to just contain a serialized Sequence object which I should be able to be deserialized in-memory.
Because this object is already in memory ( for reason I cannot control) the only way I have to deserialize it and actually see the output ( which is a json file) currently is to write it as a file locally, move that file into HDFS, then read the file using the sequenceFile
method ( since that method only works with a file that is on an HDFS file path or local path on every node) - this creates problems in the application workflow.
What I need to be able to do is deserialize this in memory so that I can write it as a json file without having to write is locally and then put it into HDFS only to read it back in with spark
Is there anyway in python to take this bytes like NullWritable Object and deserialize it into either a python dictionary or put it back into hadoop as something that I could actually read?