MapReduce or a batch job?

Question

I have a function which needs to be called on a lot of files (1000's). Each is independent of another, and can be run in parallel. The output of the function for each of the files does not need to be combined (currently) with the other ones. I have a lot of servers I can scale this on but I'm not sure what to do:

1) Run a MapReduce on it

2) Create 1000's of jobs (each has a different file it works on).

Would one solution be preferable to another?

Thanks!

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

MapReduce will provide significant value for distributing large dataset workloads. In your case, being smaller independent jobs on small independent data files, in my opinion it could be overkill.

So, I would prefer run a bunch of dynamically created batch files.

Or, alternatively, use a cluster manager and job scheduler, like SLURM https://computing.llnl.gov/linux/slurm/

SLURM: A Highly Scalable Resource Manager

SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

+1. In general, you want to do the simplest thing that will work well for your problem. Setting up mapreduce to do something that can be done with independant scripts on independant files is, as @PA points out, overkill. Having said that, if you know you want to learn MapReduce for some other reason and you want to use this simpler problem as a starting point, go to it. But I wouldn't otherwise recommend it for this case. — Jonathan Dursi, Jul 12 '11 at 11:10

score 2 · Answer 2 · answered Jul 21 '11 at 10:57

Since it is only 1000's of files (and not 1000000000's of files) a full blown HADOOP setup is probably overkill. GNU Parallel tries to fill the gap between sequential scripts and HADOOP:

ls files | parallel -S server1,server2 your_processing {} '>' out{}

You will probably want to learn about --sshloginfile. Depending on where the files are stored you may want to learn --trc, too.

Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

score 0 · Answer 3 · answered Nov 28 '22 at 13:08

Use a job array in slurm. No need to submit 1000s of jobs...just 1 - the array job.

This will kick off the same program on as many nodes / cores as are available with the resources you specify. Eventually it will churn through them all. Your only issue is how to map the array index to a file to process. Simplest way would be to prepare a text file with a list of all the paths, one per line. Each element of the job-array will get the ith line of this file and use that as the path of the file to process.

MapReduce or a batch job?

3 Answers3