Parallelize and orchestrate a large workload with IO dependency

Question

I have a service that, when invoked, performs expensive operations on a large dataset.

The dataset is a list of items, i.e. something like a List<Item> which contains an average of a few million Item instances.
All Item instances in the list are different from each other and the service executes the same method on it, called Process(Item item): the Process(Item item) method is mostly CPU-bound, however, it requires exclusive access to a file on the File System to process the given Item correctly: this means all the items in the list cannot be processed in parallel.

Due to the large amount of data that needs to be processed, I am looking into a way to improve the performance by processing the items in parallel.

A simple (but not elegant) way to do that would be to make a few copies of the file and run an equal amount of threads: this would allow me to process as many Item instances in parallel as the total amount of file copies I make.

However, I wish to find a more clean and elegant way as I don't want to manually handle those file copies.

To do that, I am looking into using Docker containers and Kubernetes.
In such a setup, the Docker image would include both the service runtime as well as the file, so that each container (or Pod) that is created from that image would have its own copy of the file.

The question:
At this point, I am mostly missing how to orchestrate the processing of the Item instances across the various containers in a solid way. How to do that?

Note that a similar question was raised in this StackOverflow question and most answers suggested to rely on Kubernetes liveness and readiness probes to avoid traffic to be routed to a given Pod, in my case the Pod that is already processing an Item instance.
However, I don't think probes where designed to be used this way and it is an approach that feels more like a hack to me, therefore I am looking for a more solid solution to better control how the Item instances are processed.

What exactly is in the file? Can you load its content in memory, and use it as a readonly data source? Can you split the content of the file into smaller components, so that you would need to traverse only a smaller part of the file? The option of running everything in docker doesn't really seem to be much different than the option of copying the file and running multiple instances, the only difference being your application now is more complex and has more dependencies. That being said, you can take a look at this nuget package: https://github.com/mariotoffia/FluentDocker — npinti, Jan 05 '23 at 19:52
The file as well as the `Process(Item item)` implementation is part of a legacy implementation which cannot be changed, so the solution should deal with the fact that those components unfortunately cannot be changed. — smn.tino, Jan 05 '23 at 20:34
Solving a file access issue with k8s has got to be one of the most absurd things I've come across. But then again, I work almost exclusively with multiple legacy systems, and symphatise with the amount of acrobatics needed to get some things to work. So good luck! — NPras, Jan 06 '23 at 00:18
@NPras thanks, that is a actually a good observation - in principle, I agree data and application should not sit together within a container image, the preferable solution would certainly be to update the code in such a way that there's no restriction over consuming a file in the file system and e.g. just load it in memory. However, this is one of those cases where the optimal solution cannot be achieved due to time and cost restrictions, so I am looking for a solid alternative. — smn.tino, Jan 06 '23 at 07:26
Is decompiling `Process(Item)` and reimplementing it an option? I've had to do it once. Ended up with an unholy amount of reflection to access `private` fields & methods; but if it works, it's worth it. — NPras, Jan 06 '23 at 08:21
I think you're asking two separate questions here. You seem to be asking, abstractly, how can you make your sequential pipeline use multiple processes? Kubernetes could be a good answer, given multiple processes, to have them use multiple nodes (maybe dynamically allocating cloud resources). But on its own, Kubernetes doesn't directly address problems like batch processing on a large data queue. — David Maze, Jan 06 '23 at 12:23
Did the above comments help you to resolve the issue?If yes, can you provide the resolution steps you have followed and provide it as an answer for the greater visibility of the community. — Fariya Rahmat, Jan 09 '23 at 10:05
@NPras decompiling is not really an option as this module is maintained by another Team and there's no plan to optimize it in such a way that data is consumed more cleverly, so I need to plug it in as-is — smn.tino, Jan 10 '23 at 11:05
Is the file static or changing over time? Is it possible for the ProcessItem method pass the filename to elaborate? is possible split the file? if so, I would create an application layer that takes care of dividing the large file into smaller files to be processed by the methods. And after processing I would mark the files as processed, regardless of k8s involvement. — kiggyttass, Jan 14 '23 at 20:19

Parallelize and orchestrate a large workload with IO dependency

0 Answers0