I have a Oozie workflow that runs a Map-reduce job within a particular queue on the cluster.
I have to add more input sources/clients to this job, so this job will be processing n times more data than what it does today.
My question is If instead of have one big job processing all the data, if I break it down into multiple jobs, one per source, will I reduce the total amount of time the jobs will take to complete?
I know Mapreduce anyhow breaks down a job into smaller jobs and spreads them across the grid, so one big job should be the same as multiple small jobs.
Also the capacity allocation within the queue is done on a 'per user' basis[1], so no matter how many jobs are submitted under one user the capacity allocated to the user will be the same. Or is there something I am missing?
So will my jobs really run any faster if broken down into smaller jobs?
Thanks.
[1] https://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html#Resource+allocation