0

I am working on an aggregation platform. We want to store resized version of 'aggregated' images from the web on our servers. To be specific, these images are of e-commerce products from different vendors. The 'item' dictionary has "image" field which is a url and needs to be downloaded and compressed and saved to disk.

download and compression method:

def downloadCompressImage( url, width, item):
    #Retrieve our source image from a URL

    #Load the URL data into an image
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    response = opener.open(url)
    img = cStringIO.StringIO(response.read())   
    im = Image.open(img)

    wpercent = (width/float(im.size[0]))
    hsize = int((float(im.size[1])*float(wpercent)))

    #Resize the image
    im2 = im.resize((width, hsize), Image.ANTIALIAS)  

    key_name = item["vendor"] + "_" + hashlib.md5(url.encode('utf-8')).hexdigest()+ "_" + str(width) + "x" + str(hsize) + ".jpg"

    path = "/var/www/html/server/images/" 
    path = path + timestamp + "/"

    #save compressed image to disk
    im2.save(path + key_name, 'JPEG', quality = 85)
    url = "http://server.com/images/" + timestamp + "/" + key_name
    return url

worker method:

def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = []
for line in lines:
    line = line.rstrip('\n') 
    item = json.loads(line.decode('ascii', 'ignore'))

    #
    #Do stuff with the item dict and update it
    #

    # Append item to result if image dl and compression is successful
    try:
        item["grid_image"] = downloadCompressImage(item["image"],200, item)
    except:
        print "dl-comp exception in processing: " + item['name'] + item['vendor']
        traceback.print_exc(file=sys.stdout)
        continue

    if(item["grid_image"] != -1):
        result.append(item)

return result

main method:

if __name__ == '__main__':
# configurable options.  different values may work better.
numthreads = 15
numlines = 1000

lines = open('parserProducts.json').readlines()

# create the process pool
pool = multiprocessing.Pool(processes=numthreads)

for result_lines in pool.imap(worker,(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) ):
    for line in result_lines:
        jdata = json.dumps(line)
        f.write(jdata+',\n')

pool.close()
pool.join()

f.seek(-2, os.SEEK_END)
f.truncate()
f.write(']')

print "parsing is done"

My question: Is this the best I can do with python? The count of dictionary items is ~ 3 M. Without calling the "downloadCompresssImage" method in 'worker', the "#Do stuff with the item dict and update it" portion takes only 8 minutes to complete. With compression though, it seems it would take weeks, if not months.

Any ideas appreciated, thanks a bunch.

chetfaker
  • 11
  • 1
  • 5

1 Answers1

0

You are working with 3 million images here, which are downloaded from internet and then compressed. So how much time will it take, depends on 2 things as far as I can tell.

  1. Your network speed (and the speed of the target server), to download the images.
  2. Your CPU power, to compress the images.

So, it is not Python limiting you, you are doing fine with multiprocessing.Pool, main bottlenecks are your network speed and number of cores (or CPU power) you have.

Muhammad Tahir
  • 5,006
  • 1
  • 19
  • 36
  • Machine has 8 cores. with 15 threads, the speed I achieve without the call to 'downloadCompresssImage' is satisfactory. If I am writing files in parallel in the same disk location, does that hinder performance at OS level? Would writing into different disk locations ( say a separate location for each thread) speed up things ? Thanks! – chetfaker Apr 13 '16 at 09:27
  • Even with 8 cores, there are 3 million images, so the processing will take some time. About disk speed, writing to different locations might increase performance a little bit but I am sure your network connection is way below your disk speed, so disk speed shouldn't matter here. But If you think your network is faster than your disk then you need a faster disk. – Muhammad Tahir Apr 13 '16 at 10:13