I am working on an aggregation platform. We want to store resized version of 'aggregated' images from the web on our servers. To be specific, these images are of e-commerce products from different vendors. The 'item' dictionary has "image" field which is a url and needs to be downloaded and compressed and saved to disk.
download and compression method:
def downloadCompressImage( url, width, item):
#Retrieve our source image from a URL
#Load the URL data into an image
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
img = cStringIO.StringIO(response.read())
im = Image.open(img)
wpercent = (width/float(im.size[0]))
hsize = int((float(im.size[1])*float(wpercent)))
#Resize the image
im2 = im.resize((width, hsize), Image.ANTIALIAS)
key_name = item["vendor"] + "_" + hashlib.md5(url.encode('utf-8')).hexdigest()+ "_" + str(width) + "x" + str(hsize) + ".jpg"
path = "/var/www/html/server/images/"
path = path + timestamp + "/"
#save compressed image to disk
im2.save(path + key_name, 'JPEG', quality = 85)
url = "http://server.com/images/" + timestamp + "/" + key_name
return url
worker method:
def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = []
for line in lines:
line = line.rstrip('\n')
item = json.loads(line.decode('ascii', 'ignore'))
#
#Do stuff with the item dict and update it
#
# Append item to result if image dl and compression is successful
try:
item["grid_image"] = downloadCompressImage(item["image"],200, item)
except:
print "dl-comp exception in processing: " + item['name'] + item['vendor']
traceback.print_exc(file=sys.stdout)
continue
if(item["grid_image"] != -1):
result.append(item)
return result
main method:
if __name__ == '__main__':
# configurable options. different values may work better.
numthreads = 15
numlines = 1000
lines = open('parserProducts.json').readlines()
# create the process pool
pool = multiprocessing.Pool(processes=numthreads)
for result_lines in pool.imap(worker,(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) ):
for line in result_lines:
jdata = json.dumps(line)
f.write(jdata+',\n')
pool.close()
pool.join()
f.seek(-2, os.SEEK_END)
f.truncate()
f.write(']')
print "parsing is done"
My question: Is this the best I can do with python? The count of dictionary items is ~ 3 M. Without calling the "downloadCompresssImage" method in 'worker', the "#Do stuff with the item dict and update it" portion takes only 8 minutes to complete. With compression though, it seems it would take weeks, if not months.
Any ideas appreciated, thanks a bunch.