0

I'm trying to get some values(which I get using extract function) from urls which are stored in data.file and there are about 3000000 url links in the file. here is my code snippet,

from multiprocessing import Pool
p = Pool(10)
revenuelist = p.map(extract, data.file )

But the problem is, due to internet connection, this is code runs again if there connection problem. How do I add fault tolerance to my code(i:e store intermediate result, to avoid repetition of doing same task).

Community
  • 1
  • 1
gaurav1207
  • 693
  • 2
  • 11
  • 26

1 Answers1

0

A very simple solution is using a file to store your current status. Use try...finally to handle fails:

with open(FILENAME) as f:
    current = int(f.read() or 0)

if current:
    skip_lines(current)

try:
    with Pool() as pool:
        results = pool.imap(extract, data.file)
        for result in results:
            do_something(result)
            current += 1
finally:
    with open(FILENAME, "w") as f:
        f.write(str(current))

See also: 'concurrent.futures` (much cooler than multiprocessing.Pool).

A better solution would be using a database to completely track your progress, and/or use a better task queue (for example, celery) to execute your jobs.

Udi
  • 29,222
  • 9
  • 96
  • 129
  • This is kinda ugly, and that global state kills the benefits of `map`. The OP can easily transform `extract` into a total function with an optional return value and use a recursive `map/filter/reduce` strategy to separate successful and failed calls to repeat calculations for the failed (empty) cases. – Eli Korvigo Mar 12 '17 at 19:54
  • I agree this is ugly :-) Using a task queue and a database is much neater - however, if this is some kind of batch process he runs only once, and just needs a "restart where failed" feature, this should be good enough :-) – Udi Mar 12 '17 at 20:19
  • What do you mean in *"global state kills the benefits of map"?* the sole purpose of the map here is to use n processes instead of 1 (for 3000000 jobs), and `imap` should be good enough here. – Udi Mar 12 '17 at 20:21
  • `imap` is usually several times slower than `map`. If the target function is relatively light, this might affect performance significantly. – Eli Korvigo Mar 12 '17 at 20:36
  • Well, the question above clearly does not need to use `map()`, since it considers tasks already run as done :-) – Udi Mar 12 '17 at 20:42