0

I am downloading images from a link but I am facing some problems. It shows "found 0 links" and then "downloaded 0 files".

Here's the code:

import urllib.request
import re
import os

#the directory to where save the images
DIRECTORY = "book"

#the url to fetch the html page where the images are
URL = "https://www.inaturalist.org/taxa/56061-Alliaria-petiolata/browse_photos"

#the regex to get the url to the images from the html page
REGEX = '(?<=<a href=")http://\d.bp.inaturalist.org/[^"]+'



#the prefix of the image file name
PREFIX = 'page_'

if not os.path.isdir(DIRECTORY):
    os.mkdir(DIRECTORY)

contents = urllib.request.urlopen(URL).read().decode('utf-8')
links = re.findall(REGEX, contents)

print("Found {} lnks".format(len(links)))
print("Starting download...")

page_number = 1
total = len(links)
downloaded = 0
for link in links:
    filename = "{}/{}{}.jpg".format(DIRECTORY, PREFIX, page_number)
    if not os.path.isfile(filename):
        urllib.request.urlretrieve(link, filename)
        downloaded = downloaded + 1
        print("done: {} ({}/{})".format(filename, downloaded, total))
    else:
        downloaded = downloaded + 1
        print("skip: {} ({}/{})".format(filename, downloaded, total))
    page_number = page_number + 1

print("Downloaded {} files".format(total))

How can I do it?

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
Murad Ali
  • 1
  • 3
  • What is your goal here. I cannot help without knowing your goal. What links are you finding, where, how. – thatrandomperson Jun 27 '22 at 13:45
  • 1
    did you check the source of the webpage you are scraping? first of all, the images are not actually links, but rather buttons. And second of all, all of the links are to relative URLS (`/photos' for example) – Esther Jun 27 '22 at 13:45
  • So there are no URLs that match your `REGEX`. This may be the issue there. – Lich Jun 27 '22 at 13:52
  • my goal is to download all the images in this link for some deep learning purpose. @thatrandomperson here is the link "https://www.inaturalist.org/taxa/56061-Alliaria-petiolata/browse_photos" – Murad Ali Jun 27 '22 at 13:55
  • @Esther yeah these images looks like buttons as they have details inside about where they are taken and when uploaded and by whom. Can you tell me how I can download these. I am not good in this and really in trouble. thanks – Murad Ali Jun 27 '22 at 13:56
  • @Lich yeah that's also the case but I can't correct it myself, can you help me in this? I think the problem is "REGEX = '(?<= – Murad Ali Jun 27 '22 at 13:58

1 Answers1

0

I just fixed your regex and changed some logic. This script should work properly:

import urllib.request
import re
import os

#the directory to where save the images
DIRECTORY = "book"

#the url to fetch the html page where the images are
URL = "https://www.inaturalist.org/taxa/56061-Alliaria-petiolata/browse_photos"

#the regex to get the url to the images from the html page
REGEX = re.compile(r'(?:(?:https?)+\:\/\/+[a-zA-Z0-9\/\._-]{1,})+(?:(?:jpe?g|png|gif))')


#the prefix of the image file name
PREFIX = 'page_'

if not os.path.isdir(DIRECTORY):
    os.mkdir(DIRECTORY)

contents = urllib.request.urlopen(URL).read().decode('utf-8')
links = re.findall(REGEX, contents)

print("Found {} lnks".format(len(links)))
print("Starting download...")

page_number = 1
total = len(links)
downloaded = 0
page_number = 1
total = len(links)
downloaded = 0
for link in links:
    ext = link.split('.')[-1]
    filename = "{}/{}{}.{}".format(DIRECTORY, PREFIX, page_number, ext)
    urllib.request.urlretrieve(link, filename)
    downloaded = downloaded + 1
    print("done: {} ({}/{})".format(filename, downloaded, total))
    page_number = page_number + 1

print("Downloaded {} files".format(total))

By the way, I'd suggest you to use some library/framework for this job (e.g. Scrapy, BeautifulSoup etc)

Lich
  • 482
  • 2
  • 10
  • Idk how BeautifulSoup will help because all the images in the site are in script tags which I don’t think BeautifulSoup can read. – thatrandomperson Jun 27 '22 at 14:20
  • @Lich thanks for fixing it. I don't know exactly what's wrong as it is still not downloading anything and it also doesn't show any error – Murad Ali Jun 27 '22 at 14:29
  • 1
    @MuradAli it works for me. Might be permission errors or your not looking in the right place – thatrandomperson Jun 27 '22 at 14:33
  • @MuradAli please try to specify `DIRECTORY` as a full path (e.g. `/Users/username/book`) to make sure where your files should be located. – Lich Jun 27 '22 at 15:15