3

I would like to monitor a particular URL and wait until it internally redirects me by using python requests. The website will randomly redirect me after a period of time. However, I am having some issues right now. The strategy I have employed so far is something like this:

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
} 

session = requests.Session()

while success is False:
    r = session.get(url, headers=headers, allow_redirects=True)
    if keyword in r.text:
        success = True
    time.sleep(30)

print("Success.")

It seems as though every time I make a GET request, the timer is reset and so I am never redirected, I thought a session would fix this but perhaps not. Although, how am I meant to check for changes to the page without sending a new request every 30 seconds? Looking at the network tab in Chrome it seems as though the status code is 307.

If anyone knows how to resolve this issue it would be very helpful, thanks.

J P
  • 541
  • 1
  • 10
  • 20
  • for starters, if you remove `allow_redirects` you can much more simply check for `r.status_code in (301, 302)`. Outside of that is the issue that a cookie timeout is pushed out? If so you will need to block cookies ... more here https://stackoverflow.com/questions/17037668/how-to-disable-cookie-handling-with-the-python-requests-library – Matthew Story Jun 30 '18 at 01:04
  • @MatthewStory Well I think it's an internal redirect so 307, but does that really make a difference? Can I check for an internal redirect without making a new GET? I think the issue is that every time I make a new get request, they website gives me a new set of cookies and so the timer is reset. Any ideas how to fix that? – J P Jun 30 '18 at 07:15

1 Answers1

0

Selenium is the quick and ugly answer:

from selenium import webdriver

profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36")

browser = webdriver.Firefox(profile)
browser.get(url)

while success is False:
    text = browser.page_source
    if keyword in text:
        success = True
    time.sleep(30)

print("Success.")

As far using requests goes, I'd hazard to guess that your web browser is requesting the reload, does the request in the network differ in anyway than the initial request? browsermob-proxy is a great tool for deep diving into these sorts of issues, it's effectively the network tab on steroids.

Apologies for the vagueness of the last half, but it's difficult to say more without having seen the website.

  • I should've mentioned that I've simplified my code as I am actually using aiohttp to send a decent amount of requests. I've tried selenium grid but it's too heavyweight. Apart from different headers the requests are the same. – J P Jun 30 '18 at 00:09
  • I see, yeah, selenium is a poor choice then. Well, assuming that the entire process isn't handled in browser via js , have you tried using using the (or some of the) original cookies in the second request, along with the changed headers? Otherwise every time you re-query, it's possible that they send a new timing/user cookie, so all your requests appear to be from a new user/connection, regardless of how long you have been looping. – Sam Kowalski Jun 30 '18 at 00:24
  • I thought that the idea of a session was to have persistent cookies? – J P Jun 30 '18 at 00:26
  • it is, but if the server responds with 'Set-Cookie' in the headers, that cookie would overwrite a stored cookie of the same name. Allowing for the possibility that all your cookies are 'fresh', have a recent timestamp. I don't know that that's actually the problem, as it sending a new cookie would suggest the request it's replying to was viewed as a new instance/session. If you have changed your delay to an arbitrarily long time and have still not seen successes, then this is unlikely to be the issue. – Sam Kowalski Jun 30 '18 at 00:33
  • I see. Why would making the delay very long change anything? Do you know a real solution if that is the problem? Or perhaps something else I can try if it is not? Thanks – J P Jun 30 '18 at 00:40
  • A long delay just ensures that no timers could have been reset (as you mentioned in your questions), it would just be for the sake of ruling potential problems out. I would recommend setting up browser mob proxy and running selenium through that so that you can record all requests between your browser and the website. This isn't a solution, its just a means of digging. It would also let you take recorded requests and turn them into prepped requests so that you can attempt to reproduce them (http://docs.python-requests.org/en/master/user/advanced/). – Sam Kowalski Jun 30 '18 at 00:54
  • If you can reproduce secondary or tertiary or n-iary requests, then that would give you a great starting point. I would recommend focusing on the differences in the headers and attempting to reproduce the reloaded website so that you can reverse engineer the changes made to the cookies or headers. Although I'm sure you are already doing that, so I do apologize for giving a non- solution. You can always dig through their js as well, assuming it's not obfuscated. – Sam Kowalski Jun 30 '18 at 00:57