Python requests scraping images get corrupt

Question

referring the same question in Scraped images is corrupt

In my case Trying to scrape images from site with

There are 100 images, First 67 Images saved fine as .jpg, from 68 all the images are corrupted, windows says doesn't support this file format

As per other stackoverflow question I dont have the data-src, content is shown in above image

request script is

for url in urls:
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
        'Referer' : url
    })
    soup = BeautifulSoup(response.text, "html.parser")
    image_info = []
    imagePrefix = url.rsplit('/', 1)[-1:][0] + "_"
    imgNo = 0
    for item in soup.find_all('img'):
      imageID = imagePrefix + str(imgNo)   
      image_info.append((item["src"], imageID))
      imgNo = imgNo + 1
    folder_name = url.rsplit('/', 1)[-1:][0]
    df = pd.DataFrame(image_info, columns =['imageURL', 'imageID'])
    df['category'] = folder_name
    df_full = df_full.append(df)
    parent_dir = "C:/data/images/"
    path = os.path.join(parent_dir, folder_name)
    os.mkdir(path)
    for i in range(0, len(image_info)):
        download_image(image_info[i], folder_name)

How this issue can be fixed

Is the file extension available? If not try to use png or jpeg — Sidhar_t.py, Oct 18 '21 at 03:18
no, its not available, but those 67 images saved fine as image0.jpg and so on — hanzgs, Oct 18 '21 at 03:19
Have you printed the URLs to a separate file to see if you can fetch them with `wget` or `curl`? — Tim Roberts, Oct 18 '21 at 06:26
=w1280 at the end of google media urls, is usually referring to a thumbnail of original resource. As per google docs, it supports BMP, GIF, JPEG, PNG, WebP, and SVG as the formats. Try any of this formats and see if it works. — Kris, Oct 18 '21 at 06:32

Python requests scraping images get corrupt

0 Answers0