I have an issue regarding mrjob. I'm using an hadoopcluster over 3 datanodes using one namenode and one jobtracker. Starting with a nifty sample application I wrote something like the following
first_script.py:
for i in range(1,2000000):
print "My Line "+str(i)
this is obviously writing a bunch of lines to stdout the secondary script is the mrjobs Mapper and Reducer. Calling from an unix (GNU) i tried:
python first_script| python second_script.py -r hadoop
This get's the job done but it is uploading the input to the hdfs completely. Just when everything ist uploaded he is starting the second job. So my question is: Is it possible to force a stream? (Like sending EOF?) Or did I get the whole thing wrong?