2

I am using scrapy, splash, and scrapy_splash to scrape a catalog website.

The website uses a form POST to open a new item details page.

Sometimes the item detail page displays a default error page (not related to HTTP status) in Splash however if I repost the form a second time the item details are returned. I am still investigating the root cause of the response. It seems more like a timing issue than a specific check after n requests.

As a workaround, I am using the splash:on_response method to retry the form post when the error page is received.

I would like to be able to log the failed attempts for later manual processing. Is there a best practice or standard approach to collecting this information?

function main(splash)
    if splash.args.cookies then
        splash:init_cookies(splash.args.cookies)
    end   

    function web_request()
        if splash.args.http_method == 'GET' then            
            assert(splash:go{
                url=splash.args.url,
                headers=splash.args.headers,
                http_method=splash.args.http_method,
                body=splash.args.body,
            })
        else
            assert(splash:go{
                url=splash.args.url,
                headers=splash.args.headers,
                http_method=splash.args.http_method,
                body=splash.args.body,
                formdata=splash.args.formdata,
             })
        end
    end

    --- AREA OF THE CODE UNDER QUESTION
    local retry_max = 3
    local retry_count = 0
    splash:on_response(function (response)
        if string.find(response.url, 'error_check.html') ~= nil then
            if retry_count <= retry_max then
                retry_count = retry_count + 1
                web_request()
            else
                --- Not sure how to capture this in the item pipeline
                --- Also, I would like to capture the form post details
                --- such as the form data and headers
                error('Max retry exceeded' .. response)
            end
        end
    end)

    web_request()
    assert(splash:wait(0.5))

    local entries = splash:history()
    local last_response = entries[#entries].response

    return {
        url = splash:url(),
        headers = last_response.headers,
        http_status = last_response.status,
        cookies = splash:get_cookies(),
        html = splash:html(),
        har = splash:har(),
        retry_count = retry_count
    }
end
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
Charles Green
  • 413
  • 3
  • 15
  • Why do you need splash:on_response? Why doesn't running `web_request()` in a `while` loop work? – Mikhail Korobov Jul 01 '17 at 02:25
  • @MikhailKorobov great question. I didn't try that. I will now. Thank you for your reply. – Charles Green Jul 04 '17 at 02:37
  • @MikhailKorobov actually I read your response again. The splash:on_response call works. My issue is with capturing the details of the failed request so that I can add a scripted check and manually follow-up failed attempts to confirm if the product details are available or not. Sometimes the site I am scraping displays an error page. I think this is in an attempt to block automated scraping. – Charles Green Jul 04 '17 at 02:49
  • This makes sense, but I'm trying to figure out why isn't `splash:go` enough. If it is unsuccessful, i.e. assert(splash:go(...)) fails, you can use `ok, reason = splash:go()` to avoid raising an error, then call `splash:html()` (or do scraping with JavaScript or `splash:select(...)`) to get the response and inspect it if `ok` is nil. – Mikhail Korobov Jul 04 '17 at 08:44
  • @MikhailKorobov many thanks for your help. The issued ended up an error outside of this function. The above code without the retry works as expected. – Charles Green Jul 06 '17 at 03:55

0 Answers0