0

I'm new to web scraping so I am not totally sure what to do here. But I am trying to extract the images from the site in this URL:

Here are the loops that got the closest to working:

For loop with parsing function

import requests
import os as os
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse

url = "https://www.legacysurvey.org/viewer/data-for-radec/?ra=55.0502&dec=-18.5790&layer=ls-dr8&ralo=55.0337&rahi=55.0655&declo=-18.5892&dechi=-18.5714"
def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

def get_all_images(url):
    """
    Returns all image URLs on a single `url`
    """
    soup = bs(requests.get(url).content, "html.parser")
urls = []
for img in tqdm(soup.find_all("img"), "Extracting images"):
    img_url = img.attrs.get("src")
    if not img_url:
        # if img does not contain src attribute, just skip
        continue
os.getcwd()

While loop - image scraping

import requests
from bs4 import BeautifulSoup

# link to first page - without `page=`
url = 'https://www.legacysurvey.org/viewer/data-for-radec/?ra=55.0502&dec=-18.5799&layer=ls-dr8&ralo=55.0337&rahi=55.0655&declo=-18.5892&dechi=-18.5714'

# only for information, not used in url
page = 0 

while True:

    print('---', page, '---')

    r = requests.get(url)

    soup = BeautifulSoup(r.content, "html.parser")

    # String substitution for HTML
    for link in soup.find_all("img"):
        print("<img href='>%s'>%s</img>" % (link.get("href"), link.text))

    # Fetch and print general data from title class
    general_data = soup.find_all('div', {'class' : 'title'})

    for item in general_data:
        print(item.contents[0].text)
        print(item.contents[1].text.replace('.',''))
        print(item.contents[2].text)

    # link to next page

    next_page = soup.find('a', {'class': 'next'})

    if next_page:
        url = next_page.get('href')
        page += 1
    else:
        break # exit `while True`

I tried to gear both of these towards downloading the image links that output but I haven't been able to get outputs for anything I've tried. Any help is greatly appreciated!

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Autumn
  • 11
  • 2
  • please consider posting your code as text and not as image. Like this no one can run your code before somehow extracting the code out of the image. – Tofu Feb 04 '21 at 19:33
  • Oh sorry, let me go back! – Autumn Feb 04 '21 at 23:50
  • That webpage does not contain any `` tags, it contains a link to one JPG file which could be obtained. What are you trying to get? The single JPG image or the `.fits` images contained inside the `.gz` entries in the table? – Martin Evans Feb 05 '21 at 10:21
  • Oh Thank you! Essentially I am looking for a big list of images for galaxies in that cluster. I may need to use a different link source if there is only one image listed in the data. Thank you so much for your help! – Autumn Feb 05 '21 at 18:16

0 Answers0