1

I am trying to scrape a web site using node.js and request. It was working fine and then all of a sudden today I started getting errors about exceeding the maximum number of redirects. I promptly pulled up developer tools and hit the page and saw that it did a couple redirects but then gave me the response. When running in node.js obviously it did not do that. Here is the page I am hitting to scrape:

https://live-tennis.eu/en/atp-live-ranking

If you hit it in a browser you will see that it does one redirect adding a querystring parameter __r and then that takes that and puts it in set-cookie and redirects back to the original URL and the response is returned. However, when I run that in node.js it doesn't stop there and it continues to redirect until it hits the max (I believe the default is 10) and then errors. So I started adding every header that I could that was in the request I saw in developer tools in my request options and when I added the cookies all of a sudden it worked. So I googled, "how to keep cookies on redirect using request in node.js" as stumbled across some posts that implied that I should specify "jar: true" in my options which would tell request to put cookies in its internal cookie jar and pass them through. I did that and it worked. So I stripped all of my other options back out and went back to what I started with adding the jar option like this:

    var options = {
        url: 'https://live-tennis.eu/en/atp-live-ranking',
        port: 443,
        proxy: process.env.HTTPS_PROXY,
        jar: true,
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
            'Accept-Language': 'en-us',
            'Content-Language': 'en-us'
        },
        timeout: 0,
        encoding: null,
        rejectUnauthorized: false
    };

and the call is just a plain old request call like this:

        request(options, function (err, resp, body) {
            if (err) reject(err);
            else resolve(body);
        });

I ran it locally and everything worked so I published to Azure and it still doesn't work. When I look at Application Insights on Azure I can see it still had a chain of http requests with each returning a 307 redirect until it hit the max and gave an error. Now for the really odd part. Since I could not get it to work on Azure I went back to my local version and put it back exactly as it was before without the "jar: true" and it still works. I even cleared cookies and cache in Chrome just to make sure that didn't have something to do with it. Now I can't get it to fail locally again (which I was honestly just doing so I could paste the error and stack trace) but it will not run correctly on Azure.

Given that the only way I got it to work locally was by setting the cookie manually in the header in the request (which I did by simply adding 'cookie' in the headers and pasting in the value from dev tools) that had to be the reason, but why does it still work after I have taken that out and gotten rid of the jar: true, and more importantly why can I not get it to work on Azure at all?

Thanks in advance for any help Chris

Chris H
  • 501
  • 1
  • 4
  • 15
  • 1
    So, if a dynamic cookie is required, you will need a cookieJar added to `request()`. It supports that. It's in the doc how to do it. And, you will need to use the same cookie jar for each request. – jfriend00 Feb 03 '20 at 03:25
  • I'm only making one request. My request evidently gets redirected multiple times but I never get a response where i need to grab the cookies and then make subsequent requests based off that. Are you saying that I don't actually need to put anything in the cookie jar but just specify one. When I read the docs I sort of thought they were implying that by setting jar:true that a default cookie jar would be used and if cookies were set in a redirect that they would be used. Maybe that isn't the case? Just for grins I'm going to try axios but I'm still not really sure what to do. – Chris H Feb 03 '20 at 03:48
  • 1
    FYI, with your exact code and the `jar` enabled or disabled, I cannot reproduce the problem here in node.js v12.13.1 on Windows 10. I am, of course, not using any HTTPS_PROXY like the code shows since I don't know how to simulate what you're using there. – jfriend00 Feb 03 '20 at 03:54
  • @jfriend00 thanks for digging... as it turns out my issue was not what I thought and you pointed me in the right direction. Turns out I let an environment file accidentally get published and all of a sudden it was trying to use the HTTPS_PROXY which was of course failing on Azure. That fixed the Azure problem, but I'm still stumped as to why I was getting the max redirects error running locally that I no longer seem to be getting whether I have jar enabled or disabled. – Chris H Feb 03 '20 at 15:28

1 Answers1

1

Thanks to @jfriend00 for pointing me in the right direction. My issue was an inadvertently published environment file that was causing the production deployment to try to use HTTPS_PROXY that I definitely did not want on Azure. That problem is now solved.

Chris H
  • 501
  • 1
  • 4
  • 15