2

I am using multiprocessing module to do parallel url retrieval. My code is like:

pat = re.compile("(?P<url>https?://[^\s]+)")
def resolve_url(text):
    missing = 0
    bad = 0
    url = 'before'
    long_url = 'after'
    match = pat.search(text) ## a text looks like "I am at here. http:.....(a URL)"
    if not match:
        missing = 1
    else:
        url = match.group("url")
        try:
            long_url = urllib2.urlopen(url).url
        except:
            bad = 1
    return (url, long_url, missing, bad)

if __name__ == '__main__':
    pool = multiprocessing.Pool(100)
    resolved_urls = pool.map(resolve_url, checkin5)  ## checkin5 is a list of texts

The issue is, my checkin5 list contains around 600,000 elements and this parallel work really takes time. I wanna check in the process how many elements have been resolved. If in a simple for loop, I can do this like:

resolved_urls = []
now = time.time()
for i, element in enumerate(checkin5):
    resolved_urls.append(resolve_url(element))
    if i%1000 == 0:
        print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
        now = time.time()

But now I need to increase the efficiency, so multiprocess is necessary, but I don't know how to inspect the process in this case, any idea?

By the way, to check whether the above method also works in this case, I tried a toy code:

import multiprocessing
import time

def cal(x):
    res = x*x
    return res

if __name__ == '__main__':
    pool = multiprocessing.Pool(4)

    t0 = time.time()
    result_list = pool.map(cal,range(1000000))
    print(time.time()-t0)

    t0 = time.time()
    for i, result in enumerate(pool.map(cal, range(1000000))):
        if i%100000 == 0:
            print("%d elements have been calculated, %2.5f" %(i, time.time()-t0))
            t0 = time.time()

And the results are:

0.465271949768
0 elements have been calculated, 0.45459
100000 elements have been calculated, 0.02211
200000 elements have been calculated, 0.02142
300000 elements have been calculated, 0.02118
400000 elements have been calculated, 0.01068
500000 elements have been calculated, 0.01038
600000 elements have been calculated, 0.01391
700000 elements have been calculated, 0.01174
800000 elements have been calculated, 0.01098
900000 elements have been calculated, 0.01319

From the result, I think the method for single-process doesn't work here. It seems that the pool.map will be called first and after the calculation is finished and the complete list is obtained, then the enumerate begins....Am I right?

gladys0313
  • 2,569
  • 6
  • 27
  • 51

2 Answers2

3

You should be able to do this with either Pool.imap or Pool.imap_unordered depending on whether or not you care about the ordering of the results. They're both non-blocking...

resolved_urls = []
pool = multiprocessing.Pool(100)
res = pool.imap(resolve_url, checkin5)

for x in res:
    resolved_urls.append(x)
    print 'finished one'
    # ... whatever counting/tracking code you want here
danf1024
  • 421
  • 3
  • 3
  • Hi, thanks, YakymPirozhenko also reminds me this, but in this case the whole process takes longer....in my example, calculate the sqrt for 100000 elements, the execution time increases from 0.35s (pool.map) to ~13s(pool.imap)...hmm... – gladys0313 Apr 28 '16 at 20:38
  • This is not really surprising as for a simple function like squaring, the amount of work that goes into resolving race conditions outweighs the benefit of parallelization. My guess is that with HTTP request processing `imap` will be almost as fast as `map`. I will post more details below. – hilberts_drinking_problem Apr 28 '16 at 20:55
1

First, I believe that @danf1024 has the answer. This is to address the slow down issue when switching from pool.map to pool.imap.

Here is a little experiment:

from multiprocessing import Pool


def square(x):
    return x * x


N = 10 ** 4
l = list(range(N))


def test_map(n=N):
    list(Pool().map(square, l))

# In [3]: %timeit -n10 q.test_map()
# 10 loops, best of 3: 14.2 ms per loop


def test_imap(n=N):
    list(Pool().imap(square, l))

# In [4]: %timeit -n10 q.test_imap()
# 10 loops, best of 3: 232 ms per loop


def test_imap1(n=N):
    list(Pool(processes=1).imap(square, l))

# In [5]: %timeit -n10 q.test_imap1()
# 10 loops, best of 3: 191 ms per loop


def test_map_naive(n=N):
    # cast map to list in python3
    list(map(square, l))

# In [6]: %timeit -n10 q.test_map_naive()
# 10 loops, best of 3: 1.2 ms per loop

Because squaring is a cheap operation when compared to, say, downloading and parsing a web page, parallelization will have gains if each processor can process large uninterrupted chunks of input. This is not the case with imap, which performs very poorly on my 4 cores. Amusingly, restricting the number of processes to 1 makes imap go faster because racing conditions are removed.

However, when you move to more costly operations the difference between imap and map becomes less and less significant.

hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
  • wow, amazing, thank you very much! I clicked on dan's answer because his answer is right for this question---I think it will be more straightforward and useful if other people also share my problem. But your explanation and help are super for me. Thank you! – gladys0313 Apr 28 '16 at 21:16
  • I am glad you find it helpful. Also, given your task, you may want to check out the eventlet library: http://eventlet.net/ – hilberts_drinking_problem Apr 28 '16 at 21:42