1

I would like to chain 2 linux commands together such that command 1 can continually pass/stream data to command 2. Command 1 in my case, generates a (line-by-line) database export, which i would like to pipe in to command 2 (which will call a function to write each data export line to a new db).

E.g.

 command 1 | command 2 

This Phyton based solution looks really nice, but also blocks, and just reads the stout line by line AFTER its all available.. How to pipe input to python line by line from linux program?

As way of background, I'm attempting to export a 10 TB Cassandra db, which i can do using dsbulk. My idea/preference here is to not build up a 10 TB export and then process it, i would like to process it "in flight"

Any pointers appreciated, thanks.

Damo
  • 1,449
  • 3
  • 16
  • 29
  • 1
    The question you link to is using xargs, which makes no sense in this case. Just use `sys.stdin` read functions or plain `input()` to read data coming from that pipe. Python will raise EOFError on `input()` once the previous command of your pipe finishes. – Tronic Apr 04 '20 at 15:56
  • Thanks for ur reply. With what u are describing, do u think that will solve my problem? When/how does the input() function u describe get invoked? Ok, its a Python script, so we have something like: dsbulk params.... | python perist_each_record_viaRead ? The whole thing still blocks right? – Damo Apr 04 '20 at 16:12
  • 1
    I depends on the reader. If the reading process attempts to read all of the data before it does any processing, they you are likely to run into issues. if the reader is well designed, it will not try to read all of the data before it does any processing. – William Pursell Apr 04 '20 at 16:25
  • An operation on a pipe might block, but it doesn't even make sense to ask if the pipe blocks. – William Pursell Apr 04 '20 at 16:27

1 Answers1

1

source.py

for i in range(1000):
    print("Test", i)

dest.py

import sys

for line in sys.stdin:
    print(line.swapcase().strip())

Then try something like python3 source.py | python3 dest.py

Tronic
  • 1,248
  • 12
  • 16
  • so that is interesting. When I increase outer range to 1000000000, its v clear to see the "streaming" of data from command 1 to command 2, which i think answers my question, and pipe command dont block by default? Am a bit confused still. – Damo Apr 04 '20 at 16:44
  • The pipe itself buffers at most 64 KiB but the `xargs your_program` reads the entire input before even starting your program because it has to pass all input as command line parameters `sys.argv` to your program. – Tronic Apr 04 '20 at 16:56
  • so in the above example, does dest.py have to read the entire input (i.e. the output of source.py) BEFORE it can be invoked / start doing its job? – Damo Apr 05 '20 at 13:15
  • ok, so i see the blocking. (i have updated the source.py to sleep for 1 second every time it does a print, and the dest.py only receives the data after source.py is complete). All good learning, but I feel I am at square 1. I have a command that will produce ~ 10 TB of text. I need a second command to start receiving that ASAP, as stdin wont of course be able to handle 10 TB. – Damo Apr 05 '20 at 13:32
  • If you do processing inside that for loop of dest.py, only a little bit is buffered. If the processing in dest.py is slower, the print will block (slow down) running of source.py to match. If source is slower, the loop in dest.py will run slower (stdin read blocks until there is more input). – Tronic Apr 05 '20 at 13:58
  • i see no buffering. To recap, source.py does its print("Test", i), then sleeps for 1 second, then does it next print("Test", i), etc. I have set the range to 10, and the behaviour I can see is dest.py prints everything at once, after 10 seconds has elapsed. – Damo Apr 05 '20 at 14:54