I have a service that, when invoked, performs expensive operations on a large dataset.
The dataset is a list of items, i.e. something like a List<Item>
which contains an average of a few million Item
instances.
All Item
instances in the list are different from each other and the service executes the same method on it, called Process(Item item)
: the Process(Item item)
method is mostly CPU-bound, however, it requires exclusive access to a file on the File System to process the given Item
correctly: this means all the items in the list cannot be processed in parallel.
Due to the large amount of data that needs to be processed, I am looking into a way to improve the performance by processing the items in parallel.
A simple (but not elegant) way to do that would be to make a few copies of the file and run an equal amount of threads: this would allow me to process as many Item
instances in parallel as the total amount of file copies I make.
However, I wish to find a more clean and elegant way as I don't want to manually handle those file copies.
To do that, I am looking into using Docker containers and Kubernetes.
In such a setup, the Docker image would include both the service runtime as well as the file, so that each container (or Pod) that is created from that image would have its own copy of the file.
The question:
At this point, I am mostly missing how to orchestrate the processing of the Item
instances across the various containers in a solid way.
How to do that?
Note that a similar question was raised in this StackOverflow question and most answers suggested to rely on Kubernetes liveness and readiness probes to avoid traffic to be routed to a given Pod, in my case the Pod that is already processing an Item
instance.
However, I don't think probes where designed to be used this way and it is an approach that feels more like a hack to me, therefore I am looking for a more solid solution to better control how the Item
instances are processed.