0

I'm trying to parse an amazon product. half of the times I run the code it works well and returns the information, the other half my requests gets redirected to an amazon page which seems to have been designed to combat malicious requests. When I try to return the url of the page it returns my original input url, not the one of the amazon page. From what i've read using headers should solve this problem, but again, it only does for about half of the requests, which is really quite strange. Is there anyway to ensure i always get a real response?

Below is the code:

    import requests
    from bs4 import BeautifulSoup as soup

    #constants
    url = "https://www.amazon.com/Zephyrus-GeForce-i7-9750H-Windows-GX531GW- 
    AB76/dp/B07QN3683G/ref=sr_1_12?dchild=1&keywords=zephyrus+g15&qid=1586732721&sr=8-12"

    #Amazon data class
class items:

def __init__(self, url):
    self.url = url

#parses page and returns info
def data(self):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"}
    #get html response
    try:
        html = requests.get(self.url, headers=headers).content
    except Exception:
        print("Could not retrieve page")
    else:
        #parse the page
        pagesoup = soup(html, "html5lib")
        #get price, name
        try:
            price = pagesoup.find("span", id="priceblock_ourprice").get_text().strip()
        except Exception:
            print("Price could not be extracted")
            price = None
        try:
            name = pagesoup.find("span", id="productTitle").get_text().strip()
        except Exception:
            print("Product name could not be extracted")
            name = None
    return price, name

  #test
  item_1 = items(url)
  print(item_1.data())
AMC
  • 2,642
  • 7
  • 13
  • 35
VICTORIX
  • 3
  • 1
  • _From what i've read using headers should solve this problem, but again, it only does for about half of the requests, which is really quite strange._ Changing the user agent is unlikely to be enough. _Is there anyway to ensure i always get a real response?_ If someone has a consistent way of defeating the anti-scraping/bot measures, I would be quite surprised if they shared it openly. _As an aside, why use a class here? Also, using `except Exception` like this is a bad idea, see https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. – AMC Apr 13 '20 at 00:26
  • I'll be sure to find all of the correct errors to replace the Exception with. The reason i'm using a class is because I want to display data from multiple products on a spreadsheet, although at the moment it only works with one. I guess i'll need to get a little more creative if I want to pass the anti scraping measures. – VICTORIX Apr 13 '20 at 00:41
  • _The reason i’m using a class is because I want to display data from multiple products on a spreadsheet_ I’m not sure I follow, sorry. As it stands the class could easily be transformed into a single function. – AMC Apr 13 '20 at 03:49

1 Answers1

0

In my experience with crawling, it's usually best to assume that you cannot guarantee anything regarding the behavior of the side you don't control (in this case, the Amazon server).

What I would instead recommend --- which appears to already be the direction you're going in --- is to design your code such that it reacts appropriately in the event that the desired behavior isn't what happens. For example, if the nature of this specific failure case at least has a consistent, recognizable form, you can have it wait a specified period of time and try again (if it's working half the time, it's likely a timing issue). Crawlers and etc are going to fail sometimes; this is fine, as long as they're expecting the failures.

That being said, if you want to do a more thorough job of fooling the server, you can use something like wireshark or tshark to capture the actual headers your regular web browser sends so that you can match all of the headers (as opposed to just the user agent).

mwarrior
  • 499
  • 5
  • 17
  • This is a comment rather than an answer. – nicomp Apr 13 '20 at 00:29
  • @nicomp see the final paragraph for the "answer". I provided the first two paragraphs because, in my experience, they're the most appropriate response to the question given. – mwarrior Apr 13 '20 at 00:32
  • I see that this sort of problem is common, i guess I'll have to get more creative. I'll try grabbing the actual headers and see if its any good at fooling the anti scraping measures. thanks. – VICTORIX Apr 13 '20 at 00:46