Let's say I'm building something like AWS Lambda / Cloudflare Workers, where I allow users to submit arbitrary binaries, and then I run them wrapped in sandboxes (e.g. Docker containers / gVisor / etc), packed multitenant-ly onto a fleet of machines.
Ignore the problem of ensuring the sandboxing is effective for now; assume that problem is solved.
Each individual execution of one of these worker-processes is potentially a very heavy workload (think SQL OLAP reports.) A worker-process may spend tons of CPU, memory, IOPS, etc. We want to allow them to do this. We don't want to limit users to a small fixed slice of a machine, as traditional cgroups limits enable. Part of our service's value-proposition is low latency (rather than high throughput) in answering heavy queries, and that means allowing each query to essentially monopolize our infrastructure as much as it needs, with as much parallelization as it can manage, to get done as quickly as possible.
We want to charge users in credits for the resources they use, according to some formula that combines the CPU-seconds, memory-GB-seconds, IO operations, etc. This will disincentivize users from submitting "sloppy" worker-processes (because a process that costs us more to run, costs them more to submit.) It will also prevent users from DoSing us with ultra-heavy workloads, without first buying enough credits to pay the ensuing autoscaling bills in advance :)
We would also like to enable users to set, for each worker-process launch, a limit on the total credit spend during execution — where if it spends too many CPU-seconds, or allocates too much memory for too long, or does too many IO operations, or any combination of these that adds up to "spending too many credits", then the worker-process gets hard-killed by the host machine. (And we then bill their account for exactly as many credits as the resource-limit they specified at launch, despite not successfully completing the job.) This would protect users (and us) from the monetary consequences of launching faulty/leaky workers; and would also enable us to predict an upper limit on how heavy a workload could be before running it, and autoscale accordingly.
This second requirement implies that we can't do the credit-spend accounting after the fact, async, using observed per-cgroup metrics fed into some time-series server; but instead, we need each worker hypervisor to do the credit-spend accounting as the worker runs, in order to stop it as close to the time it overruns its budget as possible.
Basically, this is, to a tee, a description of the "gas" accounting system in the Ethereum Virtual Machine: the EVM does credit-spend accounting based on a formula that combines resource-costs for each op, and hard-kills any "worker process" (smart contract) that goes over its allocated credit (gas) limit for this launch (tx and/or CALL op) of the worker.
However, the "credit-spend accounting" in the EVM is enabled by instrumenting the VM that executes code such that each VM ISA op also updates a gas-left-to-spend VM register, and aborts VM execution if the gas-left-to-spend ever goes negative. Running native code on bare-metal/regular IaaS VMs, we don't have the ability to instrument our CPU like that. (And doing so through static binary translation would probably introduce far too much overhead.) So doing this the way the EVM does it, is not really an option.
I know Linux does CPU accounting, memory accounting, etc. Is there a way, using some combination of cgroups + gVisor-alike syscall proxying, to approximate the function of the EVM's "tx gas limit", i.e. to enable processes to be hard-killed (instantly/within a few ms of) when they go over their credit limit?
I'm assuming there's no off-the-shelf solution for this (haven't been able to find one after much research.) But are the right CPU counters + kernel data structures + syscalls in place to be able to develop such a solution, and to have it be efficient/low-overhead?