0

I am a beginner in scraping and to save image from a file I referenced the code from This answer.

This is the code snippent which i am using :

from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import sys
import time

def make_soup(url):
    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    html = urllib2.urlopen(req)
    return BeautifulSoup(html, 'html.parser')

def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print 'Downloading images to current working directory.'
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print 'Getting: ' + filename
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print '  An error occured. Continuing.'
    print 'Done.'

if __name__ == '__main__':
    #url = sys.argv[1]
    get_images('https://i1.adis.ws/i/jpl/sz_093868_a?qlt=80&w=600&h=672&v=1')

Though i am getting results from many sites but the url which i am using in the code is not working and I wanted the code to work on that only.

Please help me with this or is there any problem with url.

Community
  • 1
  • 1
  • What do you mean by **not working**? What did you expect, and what is instead happening? – shad0w_wa1k3r Mar 08 '17 at 07:30
  • I used these lines b=a.findAll('img') to check the html parsed output before executing the get_images function and I have also tried using various parsers other than lxml. – Nitin Kumar Singh Mar 08 '17 at 07:31
  • i expected an image to be saved locally but html parsed output is not correct from beautifulSoup – Nitin Kumar Singh Mar 08 '17 at 07:33
  • You aren't `return`ing anything from the `make_soup` function, how is it working for other websites? – shad0w_wa1k3r Mar 08 '17 at 07:36
  • I was returning it earlier but cannot able to get anything as it was empty and – Nitin Kumar Singh Mar 08 '17 at 08:48
  • You need to `return a` else your `soup = make_soup(url)` will be `None` and your script would err. – shad0w_wa1k3r Mar 08 '17 at 08:50
  • so I printed it in the function just to check the parsed html and it is also wrong as you can see by going to the url that there is a image in the url but parsed html doesn't show any img tags – Nitin Kumar Singh Mar 08 '17 at 08:52
  • The link i mentioned in the question from where I referenced the code I used the same code in the above example – Nitin Kumar Singh Mar 08 '17 at 08:53
  • But are you getting the correct response in the first place? Is it the same that you are expecting? I didn't see any valid html / xml in my response. – shad0w_wa1k3r Mar 08 '17 at 08:56
  • I have updated the code you can see but it didn't return any valid response as rightly pointed out by you – Nitin Kumar Singh Mar 08 '17 at 08:59
  • So, this question is invalid in that case :) And from what I can see by opening the url in my browser, it's actually an image itself! So, you may want to filter out such urls from your list and save the response accordingly. – shad0w_wa1k3r Mar 08 '17 at 09:03
  • You can check for the headers of the response and parse accordingly `{'Content-Length': '28281', 'X-Amp-Published': 'Sat, 21 Jun 2014 18:53:54 GMT', 'Date': 'Wed, 08 Mar 2017 08:53:53 GMT', 'Accept-Ranges': 'bytes', 'Expires': 'Wed, 08 Mar 2017 09:23:53 GMT', 'Server': 'Unknown', 'X-Amp-Source-Width': '1785', 'Connection': 'keep-alive', 'Edge-Control': 'max-age=14400', 'Cache-Control': 's-maxage=14400, max-age=1800', 'X-Amp-Source-Height': '2000', 'Access-Control-Allow-Origin': '*', 'X-Req-ID': 'ITrIxNFmOt', 'Content-Type': 'image/jpeg'}` – shad0w_wa1k3r Mar 08 '17 at 09:04
  • Also, you should use the [`requests` library](http://docs.python-requests.org/en/master/). It is better and easier to use. – shad0w_wa1k3r Mar 08 '17 at 09:06
  • But it's a proper html page with img in specific are so will I not be able to parse that tag from the page – Nitin Kumar Singh Mar 08 '17 at 09:06
  • No, it isn't a proper html page. The browser opens the image as such. It is only an image. Like I said, try checking the `Content-Type` tag in the response headers. – shad0w_wa1k3r Mar 08 '17 at 09:08
  • Thanks for the answer appreciated you help – Nitin Kumar Singh Mar 08 '17 at 09:17
  • How do I close the question – Nitin Kumar Singh Mar 08 '17 at 09:17
  • You don't have to, since the question is not off-topic IMO (which I previously thought and thus recommended to delete) – shad0w_wa1k3r Mar 08 '17 at 09:31

1 Answers1

1

The link you have in your question is an image itself.

>>> import requests
>>> r = requests.get('https://i1.adis.ws/i/jpl/sz_093868_a?qlt=80&w=600&h=672&v=1')
>>> r.headers
{'Content-Length': '28281', 'X-Amp-Published': 'Sat, 21 Jun 2014 18:53:54 GMT', 'Date': 'Wed, 08 Mar 2017 08:53:53 GMT', 'Accept-Ranges': 'bytes', 'Expires': 'Wed, 08 Mar 2017 09:23:53 GMT', 'Server': 'Unknown', 'X-Amp-Source-Width': '1785', 'Connection': 'keep-alive', 'Edge-Control': 'max-age=14400', 'Cache-Control': 's-maxage=14400, max-age=1800', 'X-Amp-Source-Height': '2000', 'Access-Control-Allow-Origin': '*', 'X-Req-ID': 'ITrIxNFmOt', 'Content-Type': 'image/jpeg'}
>>> r.headers['Content-Type']
'image/jpeg'

So, you may want to check the Content-Type first and then see if you want to go over the link (crawl more urls) and extract images from it.

shad0w_wa1k3r
  • 12,955
  • 8
  • 67
  • 90