Automating scheduling and dispatch of compute jobs by different users

Question

I'm going to set up a Linux server (probably CentOS) in a computer science department. The server will be used as a compute server, by people doing research on GPU computing, bioinformatics, or AI.

Hypothetically I could just give a shell to each user and let them launch their jobs, and probably that's just what I'll do at the beginning.

However, I'm faced with a potential problem: sometimes the machine will be used as a computing facility with the aim of just getting the computation results, while sometimes it will be used as a benchmarking platform, in order to measure the efficiency of new techniques/algorithms/whatever.

This means that, while the server is being used for a task of the second kind, other users should not be able to launch other heavy tasks, interfering with the benchmarking results.

So I'd like to setup and possibly automate a system of the likes:

Typically, users have no resource limits, and different jobs are scheduled and share the system's resources normally.
If a user launches a "priority" job, other users are put into a restricted cgroup, limited to only one or two of the available CPUs, and with a restricted limit of memory usage.
The priority job is launched on a separate cgroup that has access to all the other CPUs and has no limit on memory usage

Is there some software package that helps automating such an architecture? Everything I find on the internet talks about orchestrating containers, but the difference here is that I want to restrict resources used by others while my job is running, so lunching the job in a container does not help.

I've also looked at something like dockersh, to implement the reverse: everybody directly log-in inside a container, so I can easily allocate resources to each on-demand. But, dockersh seems unmaintained, and I didn't find anything else that implements the same concept.

score 2 · Accepted Answer · answered Jul 04 '19 at 12:56

2

Linux with systemd can set cgroups resource control on units. You might set a large ratio for machine.slice, such as CPUShares=100000. Then run the important jobs as containers in that slice.

A quota only imposed when jobs run could be trickier. Quotas are static and may need to be adjusted. Possibly scripted like so (thanks Chris's wiki)

systemctl --runtime set-property user-1001.slice CPUQuota=200%

If you need some additional features like batch job control, you would need to find or write a system that does that.

answered Jul 04 '19 at 12:56

John Mahowald

32,050
2
19
34

Is it essential to run "important jobs as containers"? Can I just run the process in the slice? – gigabytes Jul 04 '19 at 13:14
No. A container uses the default machine.slice which already is distinguishable from interactive users. Instead, set resource controls on any unit at any level, a custom slice, a service, etc. – John Mahowald Jul 04 '19 at 13:40

Automating scheduling and dispatch of compute jobs by different users

1 Answers1