2

I am new to programming and am trying to build my first little web crawler in python.

Goal: Crawling a product list page - scraping brand name, article name, original price and new price - saving in CSV file

Status: I've managed to get the brand name, article name as well as original price and put them into correct order into a list (e.g. 10 products). As there is a brand name, description and price for all items, my code get them in correct order into the csv.

Code:

    import bs4 
    from urllib.request import urlopen as uReq
    from bs4 import BeautifulSoup as soup

    myUrl = 'https://www.zalando.de/rucksaecke-herren/'

    #open connection, grabbing page, saving in page_html and closing connection 
    uClient = uReq(myUrl)
    page_html = uClient.read()
    uClient.close()

    #Datatype, html paser
    page_soup = soup(page_html, "html.parser")

    #grabbing information
    brand_Names = page_soup.findAll("div",{"class": "z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn"})
    articale_Names = page_soup.findAll ("div",{"class": "z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn"})
    original_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_originalPrice-2Oy4G"})
    new_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_promotionalPrice-3GRE7"})

    #opening a csv file and printing its header
    filename = "XXX.csv"
    file = open(filename, "w")
    headers = "BRAND, ARTICALE NAME, OLD PRICE, NEW PRICE\n"
    file.write(headers)

    #How many brands on page?
    products_on_page = len(brand_Names)

    #Looping through all brands, atricles, prices and writing the text into the CSV 
    for i in range(products_on_page): 
            brand = brand_Names[i].text
            articale_Name = articale_Names[i].text
            price = original_Prices[i].text
            new_Price = new_Prices[i].text
            file.write(brand + "," + articale_Name + "," + price.replace(",",".") + new_Price.replace(",",".") +"\n")

    #closing CSV
    file.close()

Problem: I am struggling with getting the discounted prices into my csv at the right place. Not every item has a discount and I currently see two issues with my code:

  1. I use .findAll to look for the information on the website - as there are less discounted products then total products, my new_Prices contains fewer prices (e.g. 3 prices for 10 products). If i would be able to add them to the list, I assume they would show up in the first 3 rows. How can i make sure to add the new_Prices to the right prodcuts?

  2. I am getting "Index Error: list index out of range" Error, which i assume is caused by the fact that i am looping through 10 products, however for new_Prices i am reaching the end quicker then for my other lists? Does that make sense and is that my assumption correct?

I am very much appreciating any help.

Thank,

Thorsten

  • Please do not post screenshots of your code, copy relevant code into code blocks. – bgse Nov 05 '17 at 17:58
  • 1
    post a input example too – Guilherme Nov 05 '17 at 18:03
  • @bgse updated with code into blocks – Thorstein Torento Nov 05 '17 at 18:03
  • 1
    @Guilherme not sure if i understand, could you please elaborate? What do you mean with input example – Thorstein Torento Nov 05 '17 at 18:04
  • @ThorsteinTorento I believe that Guilherme is asking you to post a link to the site in question. It would help us to understand what isn't working in your code – emporerblk Nov 05 '17 at 18:15
  • im asking for a example of the product list, that you get, to see the problem and think about how to work around it – Guilherme Nov 05 '17 at 19:21
  • @emporerblk Thanks for clarifying. I will update the code in 1 sec – Thorstein Torento Nov 05 '17 at 22:15
  • @Guilherme: Thank you for clarifying. I've updated the code with "myUrl = 'https://www.zalando.de/rucksaecke-herren/'" Thanks for your help guys, hope you can help me out – Thorstein Torento Nov 05 '17 at 22:16
  • Only 6 of the 24 items have a `z-nvg-cognac_infoContainer-MvytX` class. You could select items by `'.z-nvg-cognac_infoContainer-MvytX'`, then find brand, article, price, new price ( or None) – t.m.adam Nov 06 '17 at 18:37
  • Hi @t.m.adam thank you for the hint! I was thinking about this but was struggling with navigating to the right div for price. So far i've been navigating "container.div.div" per example but this will always get me deeper down into the first div. I would need to jump into the second, so someting like containt.div[x].div but am not sure about the syntax. Could you help out? – Thorstein Torento Nov 09 '17 at 18:33

1 Answers1

0

Since some items don't have a 'div.z-nvg-cognac_promotionalPrice-3GRE7' tag you can't use the list index reliably.
However you can select all the container tags ('div.z-nvg-cognac_infoContainer-MvytX') and use find to select tags on each item.

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import csv

my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
client = urlopen(my_url)
page_html = client.read().decode(errors='ignore')
page_soup = soup(page_html, "html.parser")

headers = ["BRAND", "ARTICALE NAME", "OLD PRICE", "NEW PRICE"]
filename = "test.csv"
with open(filename, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(headers)

    items = page_soup.find_all(class_='z-nvg-cognac_infoContainer-MvytX')
    for item in items:
        brand_names = item.find(class_="z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn").text
        articale_names = item.find(class_="z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn").text
        original_prices = item.find(class_="z-nvg-cognac_originalPrice-2Oy4G").text
        new_prices = item.find(class_="z-nvg-cognac_promotionalPrice-3GRE7")
        if new_prices is not None: 
            new_prices = new_prices.text 
        writer.writerow([brand_names, articale_names, original_prices, new_prices])

If you want to get more than 24 items per page you have to use a client that runs js, like selenium.

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv

my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
driver = webdriver.Firefox()
driver.get(my_url)
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser")
...

Footnotes:
The naming conventions for functions and variables is lowercase with underscores.
When reading or writting csv files it's best to use the csv lib.
When handling files you can use the with statement.

t.m.adam
  • 15,106
  • 3
  • 32
  • 52
  • Hi @t.m.adam, much appreciated your feedback and suggestions! I've finally gotten there myself but your code looks much neater! One think i noticed is that the page must have changed as there are now more then 24 items on the page. Oddly when running the crawler it only picks up 24 items. Any idea why? – Thorstein Torento Nov 20 '17 at 07:59
  • Yes, the rest of the items are loaded by js. You can test this if you disable js in your browser and visit the page. You can get all the items with `selenium` or sometimes via ajax api. I will post an example when i have some free time. – t.m.adam Nov 20 '17 at 09:28
  • Hi @t.m.adam, great! Thanks! Out of interest, why would the page be set up in that way (Loading those 24 items and then the rest via JS)? Thanks, T – Thorstein Torento Nov 20 '17 at 13:17
  • I have no idea, first time i see something like this. It should be either static html or js for all items in the same group. Anyway, did you manage to get results with selenium? – t.m.adam Nov 20 '17 at 13:32
  • Hi @t.m.adam interesting! :) I will give it a try once i am off work - just had a look at it in my lunch break! I will keep you posted – Thorstein Torento Nov 20 '17 at 13:49