I am working on scrapping company logos from there web site. I have 12 million company records of their domain names.
I am trying to scrape from web site if the web site is forbidden then I am trying to scrape from wikipedia page of theirs.
This is my code which I have worked seperately for domain names and wikipedia page.
from urllib.request import urlopen
from bs4 import BeautifulSoup
htmldata = urlopen('https://en.wikipedia.org/wiki/Pepsi')
soup = BeautifulSoup(htmldata, 'html.parser')
images = soup.find_all('img')
for item in images:
print(item['src'])
The above code just fetches data from one company and prints all the image sources from wiki page. However, I need to fetch only logos from wiki page and scale it to many companies.
Output from the above code looks like this:
//upload.wikimedia.org/wikipedia/en/thumb/6/6c/Wiki_letter_w.svg/40px-Wiki_letter_w.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/6/66/Pepsi_355ml.png/150px-Pepsi_355ml.png
//upload.wikimedia.org/wikipedia/commons/thumb/2/21/HMB_Bern_New_Bern_Caleb_Bradham.jpg/220px-HMB_Bern_New_Bern_Caleb_Bradham.jpg\
However i need to fetch only the image sources whic has company logos. Expected output:
upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
Please help me to store the link in a dataframe along with domain name.