2

I'm using following infrastructure for scraping a web site:

Scrapy <--> Splash <--> Scrapoxy <--> web site

I'm doing requests via Splash execute endpoint, with a Lua script like this:

function main(splash)
    local host = "..."
    local port = "..."
    local username = "..."
    local password = "..."

    splash:on_request(function (request)
        request:set_proxy{host, port, username=username, password=password}
    end)

    splash:go(splash.args.url)
    return splash:html()
end

I want to detect bans and remove banned proxies. According to Scrapoxy documentation:

Scrapoxy adds to the response an HTTP header x-cache-proxyname

But I don't see this header in response.headers. The only headers are:

{b'Content-Type': b'text/html; charset=utf-8',
 b'Date': b'Wed, 18 Apr 2018 19:02:21 GMT',
 b'Server': b'TwistedWeb/16.1.1'}

What am I doing wrong? Should I add something to the Lua script to properly return headers?


UPDATE: Actually, it doesn't seem to be a Splash problem. Scrapoxy doesn't return x-cache-proxyname even if used via HTTPie.

http -v --proxy=https:http://<user>:<password>@<scrapoxy-server>:8888 https://<site>

GET / HTTP/1.1
User-Agent: HTTPie/0.9.9
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Host: <site>


HTTP/1.1 200 OK
Server: nginx
Date: Thu, 28 Jun 2018 08:14:26 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: <...>
X-Powered-By: Express
ETag: W/"5a31b-faPJ7bjKH24S/3EvHU/8IoJHyxw"
Vary: Cookie, User-Agent
Content-Security-Policy: default-src https:; child-src https:; connect-src https: wss:; form-action https:; frame-ancestors https: http://webvisor.com; media-src https:; object-src https:; img-src https: data: blob:; script-src https: data: 'unsafe-inline' 'unsafe-eval'; style-src https: 'unsafe-inline'; font-src https: data:; report-uri /ajax/csp-report/
Content-Encoding: gzip
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
alexanderlukanin13
  • 4,577
  • 26
  • 29

1 Answers1

0

I managed to get x-cache-proxyname with this lua script

function main(splash)
 local host = "..."
 local port = "..."
 local username = "..."
 local password = "..."
 local proxy = ""
 splash:on_request(function (request)
    request:set_proxy{host, port, username=username, password=password}
 end) 
 splash:on_response_headers(function(response)
    proxy = response.headers["x-cache-proxyname"]
 end)
 splash.images_enabled = false
 splash:go(splash.args.url)
 splash:set_result_header("x-cache-proxyname", proxy)
 splash:go(splash.args.url)
 return splash:html() 
end

UPDATE: When you use HTTPs scrapoxy cannot edit headers and add x-cache-proxyname to response

Ilia
  • 543
  • 1
  • 5
  • 17