2

I have a list of strings and I want to process the strings in a periodic manner.

The period of starting processing a new string is 1 second, and it takes 3 seconds to process a string.

What I expect to observe is that from the 3rd second on, I will see a new result every second until all the strings are processed.

However, what I actually saw was that all the results showed up together when all of them are generated. So the question is, how to modify the code to achieve what I expect to see?

from twisted.internet import reactor, threads
import json
import time


def process(string):
    print "Processing " + string + "\n"
    time.sleep(3)  # simulate computation time

    # write result to file; result is mocked by string*3
    file_name = string + ".txt"
    with open(file_name, "w") as fp:
        json.dump(string*3, fp)

    print string + " processed\n"

string_list = ["AAAA", "BBBB", "CCCC", "XXXX", "YYYY", "ZZZZ"]

for s in string_list:
    # start a new thread every second
    time.sleep(1)
    threads.deferToThread(process, s)

reactor.run()

Meanwhile, it looks like that the order in which the results are generated isn't consistent with the order in which the strings are processed. I would guess it's just printed out of order but they actually are processed in order. How to verify my guess?

Another trivial thing I noticed is that Processing YYYY is not printed in the right place. Why is that? (There should be an empty line between it and the previous result.)

Processing AAAA

Processing BBBB

Processing CCCC

Processing XXXX
Processing YYYY


Processing ZZZZ

YYYY processed

CCCC processed

AAAA processed

BBBB processed

XXXX processed

ZZZZ processed
s9527
  • 414
  • 1
  • 4
  • 14

1 Answers1

2

What this part of your code does:

for s in string_list:
    # start a new thread every second
    time.sleep(1)
    threads.deferToThread(process, s)

reactor.run()

is schedule each chunk of work with a delay of one second between each scheduling operation. Then, finally, it starts the reactor which allows processing to begin. There is no processing until reactor.run().

The use of time.sleep(1) also means your delays are blocking and this will be a problem once you solve the above.

One solution is to replace the for loop and the time.sleep(1) with a LoopingCall.

from twisted.internet.task import LoopingCall, react

string_list = [...]
def process(string):
    ...

def process_strings(the_strings, f):
    def dispatch(s):
        d = deferToThread(f, s)
        # Add callback / errback to d here to process the
        # result or report any problems.
        # Do _not_ return `d` though.  LoopingCall will
        # wait on it before running the next iteration if
        # we do.

    string_iter = iter(the_strings)
    c = LoopingCall(lambda: dispatch(next(string_iter)))
    d = c.start(1)
    d.addErrback(lambda err: err.trap(StopIteration))
    return d

def main(reactor):
    return process_strings(string_list, process)

react(main, [])

This code uses react to start and stop the reactor (it stops when the Deferred returned by main fires). It uses LoopingCall started with a period of 1 to run f(next(string_iter)) in the threadpool until StopIteration (or some other error) is encountered.

(LoopingCall and deferToThread both take *args and **kwargs to pass on to their callable so if you prefer (it's a matter of style), you can also write that expression as LoopingCall(lambda: deferToThread(f, next(string_iter))). You cannot "unwrap" the remaining lambda because that would result in LoopingCall(deferToThread, f, next(string_iter)) which only evaluates next(string_iter) once at the time LoopingCall is called so you would end up processing the first string forever.)

There are other possible approaches to scheduling as well. For example, you could use cooperate to run exactly 3 processing threads at a time - starting a new one as soon as an older one completes.

from twisted.internet.defer import gatherResults
from twisted.internet.task import cooperate

def process_strings(the_strings, f):
    # Define a generator of all of the jobs to be accomplished.
    work_iter = (
        deferToThread(lambda: f(a_string))
        for a_string
        in the_strings
    )
    # Consume jobs from the generator in parallel until done.
    tasks = list(cooperate(work_iter) for i in range(3))

    # Return a Deferred that fires when all three tasks have
    # finished consuming all available jobs.
    return gatherResults(list(
        t.whenDone()
        for t
        in tasks
    ))

In both cases, notice there's no use of time.sleep.

Jean-Paul Calderone
  • 47,755
  • 6
  • 94
  • 122
  • 1
    Can you please format the code that uses `cooperate`? – s9527 Jul 02 '17 at 17:49
  • In your first solution, why the function to call of `LoopingCall` is `lambda: deferToThread()` instead of `deferToThread`? – s9527 Jul 03 '17 at 09:50
  • The effect of running the first solution is that a string cannot start being processed until the previous one is processed. What I actually want is that a task can start even if there are unfinished tasks. (I haven't looked into the code of the second solution but by looking at the description it isn't what I want to see, since it claims "starting a new one as soon as an older one completes".) – s9527 Jul 03 '17 at 09:50
  • deferToThread schedules into a threadpool. If you want a very large number of tasks to be able to run concurrently, you may need to make the threadpool larger. See IReactorThreads. – Jean-Paul Calderone Jul 03 '17 at 11:13
  • In my case, if I start a new task every second and it takes 3 seconds to process each task, then it will be only 3 concurrent tasks. I suggested the thread pool size to be 10 so there won't problems about thread pool overflowing. However task B still starts after task A is finished. I thought that was due to the thread pool size was only 1 (could also be something else, this is just my worse guess), but after making it to be 10 explicitly the problem still isn't solved. – s9527 Jul 03 '17 at 17:47
  • What I want to see is this sequence: `Processing A` (0 sec), `Processing B` (1 sec), `Processing C` (2 sec), `Processing X`/`A processed` (3 sec, A finishes at the same time when X starts). How to change the code to do that? – s9527 Jul 03 '17 at 17:47
  • What does your actual processing function do? Have you tried the LoopingCall solution with a dummy function that just sleeps? – Jean-Paul Calderone Jul 03 '17 at 18:08
  • Ah, nevermind that last question. I see my mistake. Editing answer. – Jean-Paul Calderone Jul 03 '17 at 18:14