0

I am facing an error with this code. Can anyone help me with it so I can automate the process of downloading all the images in the CSV file that contain all the URLs of the images?

The error I am getting is:

        URLError                                  Traceback (most recent call last)
      <ipython-input-320-dcd87f841181> in <module>
         19         urlShort = re.search(filejpg, str(r)).group()
         20         print(urlShort)
    ---> 21         download(x, f'{di}/{urlShort}')
         22         print(type(x))
         URLError: <urlopen error unknown url type: {'https>

This is the code I am using:

from pathlib import Path
from shutil import rmtree as delete
from urllib.request import urlretrieve as download
from gazpacho import get, Soup
import re
import pandas as pd
import numpy as np


#import data
df = pd.read_csv('urlReady1.csv')
df.shape
#locate folder
di = 'Dubai'
Path(di).mkdir(exist_ok=True)

#change data to dict
dict_copy = df.to_dict('records')

#iterate over every row of the data and download the jpg file
for r in dict_copy:
    if r == 'urlready':
        print("header")
    else:
        x = str(r)
        filejpg = "[\d]{1,}\.jpg"
        urlShort = re.search(filejpg, str(r)).group()
        print(urlShort)
        download(x, f'{di}/{urlShort}')
        print(type(x))
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
dxbforce
  • 3
  • 1
  • Could you add an example of what format you are storing your `csv` file in? Also, please provide the full stack trace (it looks like part of it was cut off). – Ayush Garg Nov 24 '20 at 16:20

1 Answers1

0

I can't see your data set, but I think pandas to_dict('records') is returning you a list of dict (which you are storing as dict_copy). Then when you iterate through that with for r in dict_copy: r isn't a URL, but a dict that contains the URL in some way. So str(r) converts that dict {<stuff>} to '{<stuff>}', and you are then sending that off as your URL.

I think that's why you are seeing the error URLError: <urlopen error unknown url type: {'https>

Adding a print statement after the df dump (print(dict_copy) right after dict_copy = df.to_dict('records')), and at the beginning of your iteration (print(r) right after for r in dict_copy:) would help you see what's going on and test/confirm my hypothesis.

Thanks for adding sample data! So dict_copy is something like [{'urlReady': 'mobile.****.***.**/****/43153.jpg'}, {'urlReady': 'mobile.****.***.**/****/46137.jpg'}]

So yes, dict_copy is a list of dict, looking like 'urlReady' as the key and a URL string as a value. So you want to retrieve the url from each dict using that key. The best approach may depend on things like whether you have stuff in the data without valid URLs, etc. But this can get you started and provide a little view of the data to see if anything is weird:

for r in dict_copy:
    urlstr = r.get('urlReady', '') # .get with default return of '' means you know you can use string methods to validate data
    print('\nurl check: type is', type(urlstr), 'url is', urlstr)
    if type(urlstr) == str and '.jpg' in urlstr: # check to make sure the url has a jpg, you can replace with `if True` or another check if it makes sense
        filejpg = "[\d]{1,}\.jpg"
        urlShort = re.search(filejpg, urlstr).group()
        print('downloading from', urlstr, 'to', f'{di}/{urlShort}')
        download(urlstr, f'{di}/{urlShort}')
    else:
        print('bad data! dict:', r, 'urlstr:', urlstr)
Chris Launey
  • 136
  • 6
  • First of All I would like to thank you so much for you effort Dear Chris and I tried what you advice and here is the result of print(r): {'urlReady': 'https://mobile.*****.gov.**/****/43153.jpg'} and for the print(dict_copy) it return [{'urlReady': 'https://mobile.****.***.**/****/43153.jpg'}, {'urlReady': 'https://mobile.****.***.**/****/46137.jpg'}] # I replaced the letters by * for privacy purpose. – dxbforce Nov 24 '20 at 17:28
  • ok, updated answer with some sample code. Your sample data didn't contain the usual 'http(s)://' in front of the URLs, which will break the urllib download, but from your error message, maybe your real data does contain it. If you have problems, you may want to add in a check for `urlstr.startswith('http')` and if it fails, prepend `'http://'` (or https if appropriate) to the URL – Chris Launey Nov 24 '20 at 19:52
  • [{'urlReady': 'https:// mobile.****.***.**/****/43153.jpg'}, {'urlReady': 'https:// mobile.****.***.**/****/46137.jpg'}] no it has https but i dont know why it dose not show in the comment but in the jupyter notebook it is there the url is perfect – dxbforce Nov 24 '20 at 19:55
  • cool! ok, see how the code snipped I pasted works for you then. If there will ever be a 404 or an error, this would break, so another improvement you could add to guard against that would be a try: / except: block around the download. (also, I just fixed a typo where I had commented out the download func in the code snippet) – Chris Launey Nov 24 '20 at 20:03
  • Dear Chris, It work like a swiss army knife :) thank you for all your efforts and kindness, I really appreciate the time and effort you put in solving this issue and I wish for you all the best – dxbforce Nov 25 '20 at 09:56
  • No worries @dxbforce! Glad it helped! – Chris Launey Dec 02 '20 at 01:20