0

I want to download image for the images mentioned in the url using bs4. My code works to extract the

<div class="item-name" data-toggle="collapse" data-target="#exam-4" aria-expanded=false>
  <div class="ui-h2">April 2022 <span class="ui-tag grey-transparent">14 Exams</span></div>
</div>
<div class="item-details collapse " id="exam-4" data-parent="#exam-month">
  <div class="row">
      <div class="col-12 col-lg-4">
          <div class="ui-card hover-scale">
              <a href="https://example.com/uppsc-acf-rfo" class="card-link exam-cards">
                  <div>
                      <span class="icon calendar-icon"></span>
                      <span class="help__content help__content--small">3 Apr 2022</span>
                      <span class="ui-tag green-filled">Official</span>
                  </div>
                  <div class="footer-container">
                      <span class="exam-icon">
                      <img src="https://blogmedia.com/blog/wp-content/uploads/2020/06/uttar-pradesh-logo-png-8-5bbbec3b.png" height="30">
                      </span>
                      <span class="exam-name" title="UPPSC ACF RFO Mains">UPPSC ACF RFO Mains</span>
                      <span class="exam-cta">
                      Know More <span class="right-icon"></span>
                      </span>
                  </div>
              </a>
          </div>
      </div>

I am using the following code:

soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('div', {'class':'row'})

rowList = []
for row in rows:
    cards = row.find_all('div', {'class':re.compile("^ui-card hover-scale")})
    for card in cards:
        dateStr = card.find('span',{'class':re.compile("^help__content")}).text.strip()
        examName = card.find('span', {'class':'exam-name'}).text
        rowList.append({'date':dateStr,
                        'exam':examName})

df = pd.DataFrame(rowList)
df.to_csv('filename.csv', index=False)

Current Output:

0  3 Apr 2022  UPPSC ACF RFO Mains

Expected Output :

0  3 Apr 2022  UPPSC ACF RFO Mains    uttar-pradesh-logo-png-8-5bbbec3b.png

And .png stored in another directory. PS : I am only adding a part of html. There are multiple cards

  • Try looking at this https://stackoverflow.com/questions/37158246/how-to-download-images-from-beautifulsoup – Dutch Feb 17 '22 at 13:46

1 Answers1

0
import urllib.request
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('div', {'class':'row'})

rowList = []
for row in rows:
    cards = row.find_all('div', {'class':re.compile("^ui-card")})
    for card in cards:
        try:
            dateStr = card.find('span',{'class':re.compile("^help__content")}).text.strip()
        except Exception as e:
            print(e)
            dateStr = 'N/A'
        
        try:
            examName = card.find('span', {'class':'exam-name'}).text
        except Exception as e:
            print(e)
            examName = 'N/A'
        
        try:
            imgUrl = card.find('img')['src']
            imgFile = imgUrl.split('/')[-1]
            
            # To Write to file
            urllib.request.urlretrieve(imgUrl, imgFile)
        except Exception as e:
            print(e)
            imgFile = 'N/A'        
        
        rowList.append({'date':dateStr,
                        'exam':examName,
                        'img':imgFile})
        


df = pd.DataFrame(rowList)
df.to_csv('filename.csv', index=False)
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • imgUrl = card.find('img')['src'] TypeError: 'NoneType' object is not subscriptable. Shall I use another parameter to identify the image? – stackoverflow rohit Feb 17 '22 at 13:59
  • nope. Images will always be in the `` tag with a `src` attribute. Just like your previous question, what is happening is there is no `` tag under that particular card. So you can either deal with it using the try/except, or by first checking if there is an `` tag present, and if not, skip over it in the iteration. – chitown88 Feb 17 '22 at 14:02
  • I added a try/except in there. – chitown88 Feb 17 '22 at 14:05
  • This should work as expected, since I am downloading the image from a url, it is giving 403 Forbidden error – stackoverflow rohit Feb 17 '22 at 15:00