0

Lets say I have a website that I want to scrape. Ex. cheapoair.com

I want to use a normal requests in python to scrape the data on the first, hypothetical page. If I end up being blocked by the server, I want to switch to a proxy. I have a list of proxy servers and a method, and I also have a list of user agent strings. However, I think I need help thinking through the problem.

For reference uagen() will return a user agent string

proxit() will return a proxy

Here is what I have so far:

import requests
from proxy_def import *
from http import cookiejar
import time
from socket import error as SocketError
import sys

start_time = time.time()


class BlockAll(cookiejar.CookiePolicy):
    return_ok = set_ok = domain_return_ok = path_return_ok = lambda self, *args, **kwargs: False
    netscape = True
    rfc2965 = hide_cookie2 = False


headers = {'User-Agent': uagen()}

print(headers)

s = requests.Session()
s.cookies.set_policy(BlockAll)
cookies = {'SetCurrency': 'USD'}
sp = proxit()
for i in range(100000000000):
    while True:
        try:
            print('trying on ', sp)
            print('with user agent headers', headers)
            s.proxies = {"http": sp}
            r = s.get("http://www.cheapoair.com", headers=headers, timeout=15, cookies=cookies)
            print(i, sp, 'success')
            print("--- %s seconds ---" % (time.time() - start_time))
        except SocketError as e:
            print('passing ', sp)
            sp = proxit()
            headers = {'User-Agent': uagen()}
            print('this is the new proxy ', sp)
            print('this is the new headers ', headers)
            continue
        except requests.ConnectionError as e:
            print('passing ', sp)
            sp = proxit()
            headers = {'User-Agent': uagen()}
            print('this is the new proxy ', sp)
            print('this is the new headers ', headers)
            continue
        except requests.Timeout as e:
            print('passing ', sp)
            sp = proxit()
            headers = {'User-Agent': uagen()}
            print('this is the new proxy ', sp)
            print('this is the new headers ', headers)
            continue
        except KeyboardInterrupt:
            print("The program has been terminated")
            sys.exit(1)
        break

#print(r.text)
print('all done',
      '\n')

What I am looking for is an idea of how to say, start with a normal requests (not from a proxy), and if you end up with an error (such as being rejected by the server), switch to a proxy and try again.

I can almost picture it, but cant quite see it.

I'm thinking, that if I place a variable after

for i in range(1000000000000):

But before while true: That updates the sp then it might work. Another possibility it to maybe declare s.proxies = {"http": ""} and then if I run into an error, switch to s.poxies = {"http": "proxit()"} or s.poxies = {"http": "sp"}

Thanks!

CENTURION
  • 355
  • 3
  • 11

1 Answers1

1

I figured it out.

while True:
    try:
        #do this thing
        #but remove variable from here and declare it before "while True"
    except SockerError as e:
        #switch headers, switch user agent string
        s.proxies = {"http": proxit()}
        continue

That will refresh the variable after it gets an error from the server

CENTURION
  • 355
  • 3
  • 11