I am new to programming and am trying to build my first little web crawler in python.
Goal: Crawling a product list page - scraping brand name, article name, original price and new price - saving in CSV file
Status: I've managed to get the brand name, article name as well as original price and put them into correct order into a list (e.g. 10 products). As there is a brand name, description and price for all items, my code get them in correct order into the csv.
Code:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = 'https://www.zalando.de/rucksaecke-herren/'
#open connection, grabbing page, saving in page_html and closing connection
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
#Datatype, html paser
page_soup = soup(page_html, "html.parser")
#grabbing information
brand_Names = page_soup.findAll("div",{"class": "z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn"})
articale_Names = page_soup.findAll ("div",{"class": "z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn"})
original_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_originalPrice-2Oy4G"})
new_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_promotionalPrice-3GRE7"})
#opening a csv file and printing its header
filename = "XXX.csv"
file = open(filename, "w")
headers = "BRAND, ARTICALE NAME, OLD PRICE, NEW PRICE\n"
file.write(headers)
#How many brands on page?
products_on_page = len(brand_Names)
#Looping through all brands, atricles, prices and writing the text into the CSV
for i in range(products_on_page):
brand = brand_Names[i].text
articale_Name = articale_Names[i].text
price = original_Prices[i].text
new_Price = new_Prices[i].text
file.write(brand + "," + articale_Name + "," + price.replace(",",".") + new_Price.replace(",",".") +"\n")
#closing CSV
file.close()
Problem: I am struggling with getting the discounted prices into my csv at the right place. Not every item has a discount and I currently see two issues with my code:
I use .findAll to look for the information on the website - as there are less discounted products then total products, my new_Prices contains fewer prices (e.g. 3 prices for 10 products). If i would be able to add them to the list, I assume they would show up in the first 3 rows. How can i make sure to add the new_Prices to the right prodcuts?
I am getting "Index Error: list index out of range" Error, which i assume is caused by the fact that i am looping through 10 products, however for new_Prices i am reaching the end quicker then for my other lists? Does that make sense and is that my assumption correct?
I am very much appreciating any help.
Thank,
Thorsten