I am new to using Startcluster/qsub/grid engine to run parallel jobs and I tried reading couple of other posts regarding the same. I still am not sure of how to build a scalable solution for my specific requirement. I would like to take in some more suggestions before I proceed with the same.
Here are my requirements:
I have a huge tar file [~40 - 50 GB and it can go up to 100GB] -----> There is not much I can do here. I have accept that a huge single tar file as input.
I have to untar and uncompress it ----->I run tar xvf tarfilename.tar | parallel pbzip -d to untar and uncompress the same.
The output of this uncompression is say few hundred thousand files, approx 500,000 files.
This uncompressed files have to processed. I have modular code that can take in every single file and process it and output 5 different files.
Tar File -----Parallel Uncompression---> Uncompressed Files -----Parallel Processing ---> 5 output files per file processed
I currently have a parallel python script which runs on a 16 cores, 16GB memory taking in this list of uncompressed files and processing the same in parallel.
The problem is how do I seamlessly scale. For example, if my code has been say running for 10 hours and I would like to add one more 8 core machine to it, I cannot do it in parallel python as I would have to know the number of processors in advance.
At the same time, when I dynamically add more nodes to the currently cluster, how about the data accessibility and read/write operations?
So, I went about reading and doing basic experimentation with starcluster and qsub. While I see I can submit multiple jobs via qsub, How will I make it to take the input files from the uncompressed input folder?
For example, Can I write a script.sh which in for loop pick the file names one by one and submit the same to qsub command? Is there another efficient solution?
Say, if have 3 machines with 16 CPUs each, and if I submit 48 jobs to the queue, will the qsub launch them automatically in different CPUs of the clusters or will I have to use Parallel environment parameters like -np orte command set the number of CPUs in each cluster respectively. Is it necessary to make my python script MPI executable?
As a summary, I have a few hundred thousand files as input, I would like to submit them to a job queues to multi core machines. If I dynamically add more machines, the jobs should automatically be distributed.
Another major challenge is I need all the output of the 500,000 odd operations to be aggregated at the end? Is there a suggestion on how to aggregate output of parallel jobs as and when output is written out?
I am test running few scenarios but I would like to know if there are people who have experimented on similar scenarios.
Any suggestions using Hadoop Plugin? http://star.mit.edu/cluster/docs/0.93.3/plugins/hadoop.html
Thanks in Advance Karthick