0

I have a Oozie workflow that runs a Map-reduce job within a particular queue on the cluster.

I have to add more input sources/clients to this job, so this job will be processing n times more data than what it does today.

My question is If instead of have one big job processing all the data, if I break it down into multiple jobs, one per source, will I reduce the total amount of time the jobs will take to complete?

I know Mapreduce anyhow breaks down a job into smaller jobs and spreads them across the grid, so one big job should be the same as multiple small jobs.

Also the capacity allocation within the queue is done on a 'per user' basis[1], so no matter how many jobs are submitted under one user the capacity allocated to the user will be the same. Or is there something I am missing?

So will my jobs really run any faster if broken down into smaller jobs?

Thanks.

[1] https://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html#Resource+allocation

Community
  • 1
  • 1
Gadam
  • 2,674
  • 8
  • 37
  • 56
  • I am assuming, 'smaller jobs' means more map-reduce action in the workflow. If your current map-reduce is able to scale horizontally with added new amount data, then you need not to do anything. Other you can process the additional data using additional map-reduce action. You can should I guess define these actions under Fork-Join to execute them in parallel. Thanks. – YoungHobbit Mar 24 '17 at 05:44
  • I mean more oozie workflows or more workflow actions. Example: one job processing 100 recrods, vs. 10 jobs processing 10 records each parallelly. As you suggested, I also think both should be the same. But I wanted to make sure especially in terms of Resource Competetion within a Queue. – Gadam Mar 24 '17 at 19:12

0 Answers0