I have the following scenario:
Multiple map reduce jobs using apache crunch. These jobs are scheduled using Oozie. Lets consider only one job for simplicity. What i want to achieve is reducing the number of mappers of that job. The number of mappers is equal to the number of input splits. This job has multiple DoFn's that are processing the data.
In order to do that i tried(without success):
- changing the mapred.min.split.size / mapreduce.input.fileinputformat.split.minsize inside the global job configuration.
- add these parameters(min split size) to TableSource<ShortWritable, BytesWritable> source = new SeqFileTableSource<>(...) using source.inputConf(...) and then pipeline.read(source).parallelDo()...
- using **ParallelDoOptions.builder().conf("mapred.min.split.size / mapreduce.input.fileinputformat.split.minsize", "2000000000").build() ** inside each parallelDo
- lastly set a configuration manually with this pair for each DoFn(there are custom DoFn's which extend the DoFn abstract class) -> myCustomDoFn.setConfiguration(). I added all these settings one by one and I also tried to add all of them at once but it's not working. The number of mappers is the same.
What I'm doing wrong and how can i change the number of mappers created by a job in apache crunch ?