I have a processing chain that goes along these lines:
- Preprocess data in a few steps which include calling out perl, Bash and python scripts from a single Bash script, connecting those via pipes
- Transform data in python (the program I use sadly doesn't run on Python 3, so I think I'm forced to run 2.7)
- Postprocess data just like in the preprocessing step
One way this has worked before is
cat input | preprocess.sh | transform.py | postprocess.sh
And this works well with processing batches of input data.
However, I now find myself needing to implement this as a server functionality in Python - I need to be able to accept a single data item, run the pipeline and spit it back out quickly.
The central step I just call from within Python, so that's the easy part. Postprocessing is also relatively easy.
Here's the issue: the preprocessing code consists of 4 different scripts, each outputting data to the next one and two of which need to load model files from disk to work. That loading is relatively slow and does horrible things to my execution time. I thus think I need to keep them in memory somehow, write to their stdin
s and read the output.
However, I find that for every single link in my chain, I can't write to stdin
and read stdout
without closing stdin
, and that would render the method useless as I would then have to reopen the process and load the model again.
Do note that this is not a problem with my scripts, as for each link in the chain
cat input_data | preprocessing_script_i.sh
returns just what it should within Bash.
Here are the things I have tried up until now:
- simply write to
stdin
andflush
it - waits indefinitely on readline process.communicate
- kills the process and is thus out of the question.- using master and slave
pty
handles - hangs on readline - using a queue and a thread to read
stdout
while writing tostdin
from the main thread - messing around with
bufsize
in the call tosubprocess
Is there some way to do this from Python? Is this even possible at all, as I'm starting to doubt that? Can reimplementing this pipeline (without touching the elements, as that's not quite feasible for my use case) in another language solve this for me?