1

Pyspark has a function sequenceFile that allows us to read a sequence file which is stored in HDFS or some local path available to all nodes.

However, what if I already have a bytes object in the driver memory that I need to deserialize and write as a sequence file?

For example, the application that I am working on ( I cannot change the application logic) runs a spark job that writes this file to a non HDFS compliant file system, which i can then retrieve as an in-memory python bytes object , which seems to just contain a serialized Sequence object which I should be able to be deserialized in-memory.

Because this object is already in memory ( for reason I cannot control) the only way I have to deserialize it and actually see the output ( which is a json file) currently is to write it as a file locally, move that file into HDFS, then read the file using the sequenceFile method ( since that method only works with a file that is on an HDFS file path or local path on every node) - this creates problems in the application workflow.

What I need to be able to do is deserialize this in memory so that I can write it as a json file without having to write is locally and then put it into HDFS only to read it back in with spark

Is there anyway in python to take this bytes like NullWritable Object and deserialize it into either a python dictionary or put it back into hadoop as something that I could actually read?

enter image description here

Liam385
  • 101
  • 5
  • Did you mean its in driver memory or executor memory? Does it matter? Yes, it would be more efficient to directly transfer in memory to a format that you can write out to to hdfs with. But are you writing a system that is so time critical that this step would seriously hamper your work? I know it doesn't feel good using imperfect tools but is this the bottleneck you need to solve? – Matt Andruff Dec 09 '21 at 20:01

1 Answers1

0

Basically you'd have to look into the sequence file code of spark itself and apply the correct pieces and convert it into an RDD so that you can then do spark things on it like writing to a file.

Here's a link to get you started but it will need some digging.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21