3

I am scraping an e-commerce website with selenium, because the pages are loaded by Javascipt.

Here's the workflow -: 1. Instantiate a web diver driver in virtual display mode, while sending a random user agent. Using a random user-agent decreases you chances of detection just a little bit. This will not reduce the chances of blocking by IP. 2. For each query term, say "pajamas" - create the search url for that website - and open the url. 3. Get the corresponding text elements from Xpath, say top 10 prod ids, their prices, title of product etc. 4. Store them in a file - that I will further process

I have upwards of 38000 such urls that I need to fetch for the elements on page load. I did multiprocessing, and I realized quickly that the process was failing since after a while, the website was blocked, so the page load did not happen.

How can I IP spoof in Python and will it work with selenium driving the web for you, not urllib/urlopen ?

Aside of setting the actual fetch via the xpaths, here's the basic code - more specifically, see init_driver

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import argparse
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import codecs, urllib, os
import multiprocessing as mp
from my_custom_path import scraping_conf_updated as sf
from fake_useragent import UserAgent

def set_cookies(COOKIES, exp, driver):
    for key, val in COOKIES[exp].items():
        driver.add_cookie({'name': key, 'value': val, 'path': '/', 'secure': False, 'expiry': None})
    return driver


def check_cookies(driver, exp):
    print "printing cookie name & value"
    for cookie in driver.get_cookies():
        if cookie['name'] in COOKIES[exp].keys():
            print cookie['name'], "-->", cookie['value']


def wait_for(driver):
    if conf_key['WAIT_FOR_ID'] != '':
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, conf_key['WAIT_FOR_ID'])))
    elif conf_key['WAIT_FOR_CLASS'] != '':
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS, conf_key['WAIT_FOR_CLASS'])))
    return driver



def init_driver(base_url, url, exp):
    display = Display(visible=0, size=(1024, 768))
    display.start()
    profile = webdriver.FirefoxProfile()
    ua = UserAgent(cache=False)
    profile.set_preference("general.useragent.override",ua.random)
    driver=webdriver.Firefox(profile)
    if len(conf_key['COOKIES'][exp]) != 0:
        driver.get(base_url)
        driver.delete_all_cookies()
        driver = set_cookies(COOKIES, exp, driver)
        check_cookies(driver, exp)
    driver.get(url)
    driver.set_page_load_timeout(300)
    if len(conf_key['POP_UP']['XPATH']) > 0:
        driver = identify_and_close_popup(driver)
    driver = wait_for(driver)
    return driver
ekta
  • 1,560
  • 3
  • 28
  • 57

1 Answers1

-1

use a vpn provider or an http or socks proxy to change your apparent originating ip address from your target website

  • But all queries would still appear to originate from a single ip address and therefore be blocked. – Hoppo Jun 06 '20 at 09:54
  • generally there's a large number (not one) of outbound gateways from the bigger VPN service providers (expressVPN for example). while they will all eventually be identified, it will take a while and in the meantime the vpn provider is changes their topology frequently. It's not perfect but it gives you a longer window of time to operate. – Matt Sweetnam Jan 13 '21 at 23:54