27

This is the API provided by luminati.io a premium proxy provider. However, it returns as a byte code instead of a dictionary, so it is converted into a dictonary as to be able to extract the ip and port:

Every request will end up with a new peer proxy because the IPs rotate for every request.

import csv
import requests
import json
import time

#!/usr/bin/env python

print('If you get error "ImportError: No module named \'six\'"'+\
    'install six:\n$ sudo pip install six');
import sys
if sys.version_info[0]==2:
    import six
    from six.moves.urllib import request
    opener = request.build_opener(
        request.ProxyHandler(
            {'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'}))
    proxy_details = opener.open('http://lumtest.com/myip.json').read()
if sys.version_info[0]==3:
    import urllib.request
    opener = urllib.request.build_opener(
        urllib.request.ProxyHandler(
            {'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'}))
    proxy_details = opener.open('http://lumtest.com/myip.json').read()
proxy_dictionary = json.loads(proxy_details)

print(proxy_dictionary)

Then I plan to use the ip and port in the requests module to connect to the website of interest:

headers = {'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0'}

if __name__ == "__main__":

    search_keyword = input("Enter the search keyword: ")
    page_number =  int(input("Enter total number of pages: "))

    for i in range(1,page_number+1):
        time.sleep(10)

        link = 'https://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'
        proxy = proxy_dictionary["ip"] + ':' + str(proxy_dictionary["asn"]["asnum"])
        print(proxy)
        req = requests.get(link,headers=headers,proxies={"https":proxy})

But my issue is that it errors at the requests portion. When I change proxies={"https":proxy} to proxies={"http":proxy} there was one time it went through, but other than that, the proxy fails to connect.

Sample output:

print_dictionary = {'ip': '84.22.151.191', 'country': 'RU', 'asn': {'asnum': 57129, 'org_name': 'Optibit LLC'}, 'geo': {'city': 'Krasnoyarsk', 'region': 'KYA', 'postal_code': '660000', 'latitude': 56.0097, 'longitude': 92.7917, 'tz': 'Asia/Krasnoyarsk'}}

The details of the peer proxy is shown in the image below: Peer proxy

print(proxy) will yield 84.22.151.191:57129 which is fed into the requests.get method

The Error I get:

(Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000282DDD592B0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',)))

I tested removing the proxies={"https":proxy} argument to the requests method and the scraping works with no error. So the proxy has a issue or the way I access it.

Pherdindy
  • 1,168
  • 7
  • 23
  • 52
  • Hmm, can't test, getting "*urllib.error.URLError: *". What is "*@127.0.3.1:20005*" part at the end of the proxy? Are trying to setup one locally? – CristiFati Jan 22 '19 at 23:24
  • Are you using your *ISP*'s *ASN* number as the proxy port? You mentioned *port* but only in comments. Also shouldn't the proxy value also contain the protocol? e.g.: *http: //84.22.151.191:57129*? – CristiFati Jan 22 '19 at 23:51
  • @CristiFati `@127.0.3.3.1:20005` is what my application uses to connect to `Luminati Proxy Manager` then they will return a `peer proxy` which is `84.22.151.191:57129:57129` which i'll then use to connect to scrape the site of interest. Since I defined `proxy = proxy_dictionary["ip"] + ':' + str(proxy_dictionary["asn"]["asnum"])` then `proxies={"https":proxy}` is `proxies={"https":84.22.151.191:57129}`. Did you mean it has to be `proxies={"https":"https://84.22.151.191:57129"}`? – Pherdindy Jan 23 '19 at 12:05
  • Note you will not be able to connect using this`'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'` because I changed the details as it is my username and password to the service – Pherdindy Jan 23 '19 at 12:07
  • I also tried in the format of `proxies={"https":"https://84.22.151.191:57129"} and same error occurs. – Pherdindy Jan 24 '19 at 09:15

2 Answers2

1

When changing proxies={"https":proxy} to proxies={"http":proxy} you also have to make sure your link is http and not https so also try replacing:

link = 'https://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'

with

link = 'http://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'

Your overall code should look like this:

headers = {'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0'}

if __name__ == "__main__":

    search_keyword = input("Enter the search keyword: ")
    page_number =  int(input("Enter total number of pages: "))

    for i in range(1,page_number+1):
        time.sleep(10)

        link = 'http://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'
        proxy = proxy_dictionary["ip"] + ':' + str(proxy_dictionary["asn"]["asnum"])
        print(proxy)
        req = requests.get(link,headers=headers,proxies={"http":proxy})

Hope this helps!

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
  • I have actually tried doing that prior to posting thinking they had to be a match, but it didn't work either – Pherdindy Jan 23 '19 at 12:11
0

A little late to the party but this is what worked for me.

proxies = {'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005', 'https': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'}
            
req = requests.get(link,headers=headers,proxies=proxies)

After defining the proxies like this, I was able to hit the link and get a response. I believe luminati requires the credentials for rotating and hitting the links from their proxies