hadoop with mrjob piping on shell

Question

I have an issue regarding mrjob. I'm using an hadoopcluster over 3 datanodes using one namenode and one jobtracker. Starting with a nifty sample application I wrote something like the following

first_script.py: 
        for i in range(1,2000000): 
                 print "My Line "+str(i)

this is obviously writing a bunch of lines to stdout the secondary script is the mrjobs Mapper and Reducer. Calling from an unix (GNU) i tried:

python first_script| python second_script.py   -r hadoop

This get's the job done but it is uploading the input to the hdfs completely. Just when everything ist uploaded he is starting the second job. So my question is: Is it possible to force a stream? (Like sending EOF?) Or did I get the whole thing wrong?

score 0 · Answer 1 · answered Jul 09 '12 at 19:23

Obviously you have long since forgotten about this but I'll reply anyway: No it's not possible to force a stream. The whole hadoop programming model is about taking files as input and outputting files (and possibly creating side effects e.g. uploading the same stuff to a database).

score 0 · Answer 2 · answered Sep 27 '17 at 01:50

It might help if you clarified what you want to achieve a little more. However it sounds like you might want the contents of a pipe to be periodically processed, rather than wait until the stream is finished. The stream cant be forced.

The reader of the pipe (your second_script.py) needs break its stdin into chunks, either using

a fixed number of lines like this question and answer, or
non-blocking reads and a preset idle period, or
a predetermined break sequence emitted from first_script.py, such as a 'blank' line consisting of only \0.

hadoop with mrjob piping on shell

2 Answers2