I'm trying to capitalize on Ray's parallelization model to process a file record by record. The code works beautifully, but the object storage grows fast and ends up crashing. I'm avoiding using ray.get(function.remote()) because it kills the performance because the task is composed of a few million subtasks and the overhead of waiting for a task to finish. Is there a way to set a global limit to the object storage?
#code which constantly backpressusre the obejct storage, freeing space, but causes performance to be worse than serial execution
for record in infile:
ray.get(createNucleotideCount.remote(record, copy.copy(dinucDict), copy.copy(tetranucDict),dinucList,tetranucList, filename))
#code that maximizes throughput but makes the object storage grow constantly
for record in infile:
createNucleotideCount.remote(record, copy.copy(dinucDict), copy.copy(tetranucDict),dinucList,tetranucList, filename)
#the called function returns either 0 or 1.