Web scrapping for png images

Question

I am working on scrapping company logos from there web site. I have 12 million company records of their domain names.

I am trying to scrape from web site if the web site is forbidden then I am trying to scrape from wikipedia page of theirs.

This is my code which I have worked seperately for domain names and wikipedia page.

from urllib.request import urlopen
from bs4 import BeautifulSoup
  
htmldata = urlopen('https://en.wikipedia.org/wiki/Pepsi')
soup = BeautifulSoup(htmldata, 'html.parser')
images = soup.find_all('img')
  
for item in images:
    print(item['src'])

The above code just fetches data from one company and prints all the image sources from wiki page. However, I need to fetch only logos from wiki page and scale it to many companies.

Output from the above code looks like this:

//upload.wikimedia.org/wikipedia/en/thumb/6/6c/Wiki_letter_w.svg/40px-Wiki_letter_w.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/6/66/Pepsi_355ml.png/150px-Pepsi_355ml.png
//upload.wikimedia.org/wikipedia/commons/thumb/2/21/HMB_Bern_New_Bern_Caleb_Bradham.jpg/220px-HMB_Bern_New_Bern_Caleb_Bradham.jpg\

However i need to fetch only the image sources whic has company logos. Expected output:

upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png

Please help me to store the link in a dataframe along with domain name.

Md. Fazlul Hoque · Accepted Answer · 2022-10-31T07:04:09.883

You can try something as follows:

from urllib.request import urlopen
from bs4 import BeautifulSoup
  
htmldata = urlopen('https://en.wikipedia.org/wiki/Pepsi')
soup = BeautifulSoup(htmldata, 'html.parser')
images = soup.find_all('img')
  
for item in images:
    img = 'https:' +item['src']
    #print(img)
    if 'logo' in img:
        print(img)

Output:

https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_Cola_logo_1902.svg/90px-Pepsi_Cola_logo_1902.svg.pnghttps://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Pepsi_Cola_logo_1940.svg/90px-Pepsi_Cola_logo_1940.svg.pnghttps://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Pepsi_logo.svg/220px-Pepsi_logo.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e9/Pepsi_logo_2008.svg/220px-Pepsi_logo_2008.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/30px-Commons-logo.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a6/PepsiCo_logo.svg/130px-PepsiCo_logo.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/12px-Commons-logo.svg.png

Web scrapping for png images

1 Answers1