I am using multiprocessing
module to do parallel url retrieval. My code is like:
pat = re.compile("(?P<url>https?://[^\s]+)")
def resolve_url(text):
missing = 0
bad = 0
url = 'before'
long_url = 'after'
match = pat.search(text) ## a text looks like "I am at here. http:.....(a URL)"
if not match:
missing = 1
else:
url = match.group("url")
try:
long_url = urllib2.urlopen(url).url
except:
bad = 1
return (url, long_url, missing, bad)
if __name__ == '__main__':
pool = multiprocessing.Pool(100)
resolved_urls = pool.map(resolve_url, checkin5) ## checkin5 is a list of texts
The issue is, my checkin5
list contains around 600,000 elements and this parallel work really takes time. I wanna check in the process how many elements have been resolved. If in a simple for loop, I can do this like:
resolved_urls = []
now = time.time()
for i, element in enumerate(checkin5):
resolved_urls.append(resolve_url(element))
if i%1000 == 0:
print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
now = time.time()
But now I need to increase the efficiency, so multiprocess is necessary, but I don't know how to inspect the process in this case, any idea?
By the way, to check whether the above method also works in this case, I tried a toy code:
import multiprocessing
import time
def cal(x):
res = x*x
return res
if __name__ == '__main__':
pool = multiprocessing.Pool(4)
t0 = time.time()
result_list = pool.map(cal,range(1000000))
print(time.time()-t0)
t0 = time.time()
for i, result in enumerate(pool.map(cal, range(1000000))):
if i%100000 == 0:
print("%d elements have been calculated, %2.5f" %(i, time.time()-t0))
t0 = time.time()
And the results are:
0.465271949768
0 elements have been calculated, 0.45459
100000 elements have been calculated, 0.02211
200000 elements have been calculated, 0.02142
300000 elements have been calculated, 0.02118
400000 elements have been calculated, 0.01068
500000 elements have been calculated, 0.01038
600000 elements have been calculated, 0.01391
700000 elements have been calculated, 0.01174
800000 elements have been calculated, 0.01098
900000 elements have been calculated, 0.01319
From the result, I think the method for single-process doesn't work here. It seems that the pool.map
will be called first and after the calculation is finished and the complete list is obtained, then the enumerate
begins....Am I right?