0

I'm looking to execute a batch of processes in parallel, but process each batch in series using RXPY (we're using v3 right now). Each process is kicked off, then I use RXPY to wait for a set amount of time before ending the process. Here's a basic version:

def start_task(value):
    print(f"Started {value}")
    return value

def end_task(value):
    print(f"End: {value}")

def main():
    print("Start main")

    rx.interval(1).pipe(
        ops.flat_map(lambda time : rx.from_([1,2]).pipe(
            ops.map(lambda value: [time, value])
        )),
        ops.map(lambda value: start_task(value)),
        ops.delay(2),
        ops.map(lambda value: end_task(value)),
    ).run()

The problem with this is the long-running processes overlap each other. In other words, I do not want new processes to start before the last batch has finished. In the above example, the output is:

Start main
Started [0, 1]
Started [0, 2]
Started [1, 1]
Started [1, 2]
Started [2, 1]
Started [2, 2]
End: [0, 1]
End: [0, 2]
End: [1, 1]
Started [3, 1]
End: [1, 2]
Started [3, 2]
End: [2, 1]
End: [2, 2]
...

As you can see, time 1 and 2 started before 0 ended.

I can solve this by adding a boolean variable working, somewhat like a semaphore:

def start_task(value):
    print(f"Started {value}")
    return value

def end_task(value):
    print(f"End: {value}")

def main():
    print("Start main")

    global working
    working = False

    def set_working(input):
        global working
        working = input

    rx.interval(1).pipe(
        ops.filter(lambda time: not working),
        ops.do_action(lambda value: set_working(True)),
        ops.flat_map(lambda time : rx.from_([1,2]).pipe(
            ops.map(lambda value: [time, value])
        )),
        ops.map(lambda value: start_task(value)),
        ops.delay(2),
        ops.map(lambda value: end_task(value)),
        ops.do_action(lambda value: set_working(False)),
    ).run()

With the following output:

Start main
Started [0, 1]
Started [0, 2]
End: [0, 1]
End: [0, 2]
Started [3, 1]
Started [3, 2]
End: [3, 1]
End: [3, 2]

But this feels wrong. Is there an existing operator in RXPY that would accomplish this same functionality?

John Ericksen
  • 10,995
  • 4
  • 45
  • 75

1 Answers1

1

Even in your second solution you don't ensure that the next tasks won't start before tasks from first batch are finished.

To simply test it change end_task to:

def end_task(value):
    sleep(value[1])
    print(f"End: {value}")

the output results become:

Start main
Started [0, 1]
Started [0, 2]
End: [0, 1]
Started [3, 1]
End: [0, 2]
Started [3, 2]
Started [5, 1]

It happens because sequence that ended task first continues and sets variable working back to false even though [0, 2] is still being processed.

Solution using threading Lock

If you want you can use Lock from threading package to achieve "more pythonic" way.

Now at the beginning of your sequence you should:

  • filter if lock is currently locked. (You don't really need to do this part, but since sequence starts every second and runs for at least 2 seconds, you will have a lot of threads in queue waiting to acquire lock, so if you can filter it out it's just better).
  • acquire lock
  • group all the elements of sequence and release the lock - to ensure that we release the lock once all the tasks have ended.
  • release lock

here is my code modification:

from threading import Lock

def start_task(value):
    print(f"Started {value}")
    return value

def end_task(value):
    print(f"End: {value}")
    sleep(value[1])
    # return interval value
    return value[0]

def main():
    # Initialise Lock
    lock = Lock()
    print("Start main")

    rx.interval(1).pipe(
        # Check if lock is currently locked
        ops.filter(lambda _: not lock.locked()),
        # Acquire lock
        ops.do_action(lambda _: lock.acquire()),
        ops.flat_map(lambda time : rx.from_([1,2]).pipe(
            ops.map(lambda value: [time, value])
        )),
        ops.map(lambda value: start_task(value)),
        ops.delay(2),
        ops.map(lambda value: end_task(value)),
        # Group by interval value to ensure that we release lock when all tasks have ended
        ops.group_by(lambda value: value),
        # Release lock at the end
        ops.do_action(lambda _: lock.release()),
    ).run()
puchal
  • 1,883
  • 13
  • 25
  • I like the usages of the threading lock... but I'm not sure the group_by does much here. Maybe it should be grouped by until a certain token count is reached? – John Ericksen Mar 09 '23 at 15:58
  • 1
    Ah, I was able to accomplish this using a buffer_with_count instead of the group_by. – John Ericksen Mar 09 '23 at 16:57