0

I have a processing chain that goes along these lines:

  1. Preprocess data in a few steps which include calling out perl, Bash and python scripts from a single Bash script, connecting those via pipes
  2. Transform data in python (the program I use sadly doesn't run on Python 3, so I think I'm forced to run 2.7)
  3. Postprocess data just like in the preprocessing step

One way this has worked before is

cat input | preprocess.sh | transform.py | postprocess.sh

And this works well with processing batches of input data.

However, I now find myself needing to implement this as a server functionality in Python - I need to be able to accept a single data item, run the pipeline and spit it back out quickly.

The central step I just call from within Python, so that's the easy part. Postprocessing is also relatively easy.

Here's the issue: the preprocessing code consists of 4 different scripts, each outputting data to the next one and two of which need to load model files from disk to work. That loading is relatively slow and does horrible things to my execution time. I thus think I need to keep them in memory somehow, write to their stdins and read the output.

However, I find that for every single link in my chain, I can't write to stdin and read stdout without closing stdin, and that would render the method useless as I would then have to reopen the process and load the model again.

Do note that this is not a problem with my scripts, as for each link in the chain

cat input_data | preprocessing_script_i.sh

returns just what it should within Bash.

Here are the things I have tried up until now:

  • simply write to stdin and flush it - waits indefinitely on readline
  • process.communicate - kills the process and is thus out of the question.
  • using master and slave pty handles - hangs on readline
  • using a queue and a thread to read stdout while writing to stdin from the main thread
  • messing around with bufsize in the call to subprocess

Is there some way to do this from Python? Is this even possible at all, as I'm starting to doubt that? Can reimplementing this pipeline (without touching the elements, as that's not quite feasible for my use case) in another language solve this for me?

Dominik Stańczak
  • 2,046
  • 15
  • 27
  • 1
    Could you use a named pipe (FIFO) instead of chaining stdout and stdin? – Hannu Dec 21 '17 at 10:33
  • @Hannu I may be able to, as soon as I figure out what that is - haven't heard of the `named pipe` term. I'll get back to you after some googling! – Dominik Stańczak Dec 21 '17 at 10:35
  • 1
    `os.mkfifo("/tmp/mypipe")` would create you one. Or you can do it in shell with `mknod /tmp/mypipe p` You can then treat this as a file for reading and writing purposes, and write from one process and read from another. – Hannu Dec 21 '17 at 10:37
  • @Hannu That looks promising! I'm currently testing this, thanks! – Dominik Stańczak Dec 21 '17 at 10:42

3 Answers3

1

The easiest may be to mv files from same filesystem (because rename is atomic over file operations whereas cp is not atomic) to a "input directory". The shell loop infinitely and waits for a new file mv to "working directory" process it and mvit in "done directory" or "error directory".

Nahuel Fouilleul
  • 18,726
  • 2
  • 31
  • 36
  • Let me make sure I'm understanding you correctly: do you mean some sort of loop in Bash that watches for files in the `input` directory (which I could create from python), then preprocesses any new ones and moves them to `output`? Or do you mean implementing this from Python somehow? Any Python implementation may have the same issue of having to open the subprocess multiple times, unless `stdin` and `stdout` can be somehow hot-swapped (to account for new files coming in). Can they? – Dominik Stańczak Dec 21 '17 at 10:27
  • yes i'm talking about the bash shell, instead of reading from a pipe read from different files, the directories can be defined once by convention, files also should have a convention to identify caller process so that it could find result in done, `error` is just an idea but it seems it doesn't help here – Nahuel Fouilleul Dec 21 '17 at 10:52
  • however python should create files first in another directory and move them into input to avoid race condition – Nahuel Fouilleul Dec 21 '17 at 10:58
1

You might avoid stdin/stdout related problems with a FIFO

os.mkfifo("/tmp/fifo")

You can then use this from Python as a file for reading and writing purposes from different processes, and you can even peek into the fifo (Python: Check if named pipe has data) in your reader to check if there is something to be read there.

If nothing like this helps, another approach would be to replace inter process communication with a messaging platform. ZeroMQ (zmq) is easy enough to implement and does not need any server components, and you would then get rid of chaining inputs and outputs. You would just publish messages from one process and read them from another. Data gets transmitted but with a threaded reader you would not be stuck with blocking IO.

Hannu
  • 11,685
  • 4
  • 35
  • 51
0

I'm sorry, the ideas proposed were great and this is probably not going to help many people in the future, but this is how I solved the problem.

It turns out perl has a -b flag for printing in line buffered mode. Once I plugged that into the perl -b script.perl part of the processing pipeline, things started moving smoothly and the simple process.write() followed by .flush() was enough to get the output.

I will try to change the question tags and title to better fit the actual problem.

Dominik Stańczak
  • 2,046
  • 15
  • 27