0

I'm testing out making multiple requests to various URLs doing a some web scraping - and after the first request, the second often fails. I can't figure out why:

I make two simple requests to sites, and what's happening is the second request is returning Google-relevant response, and is failing. If I start the server and just hit Yahoo, then the request returns as expected. This same behavior happens if my first request hits Wikipedia, and subsequent requests go somewhere else.

Can someone explain whats happening?

Thanks.

deps: {:httpoison, "~> 1.5"}

First I start the server (as per docs)

iex(1)> HTTPoison.start
{:ok, []}

Next, I make a request to get Google's homepage:

iex(2)> HTTPoison.get "https://www.google.com"
{:ok,
 %HTTPoison.Response{
   body: "<!doctype html><html itemscope=\"\" itemtype=\"http://schema.org/WebPage\" lang=\"en\"><head><meta content=\"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.\" name=\"description\"><meta content=\"noodp\" name=\"robots\"><meta content=\"text/html; charset=UTF-8\" http-equiv=\"Content-Type\"><meta content=\"/logos/doodles/2019/celebrating-earl-scruggs-5680695065182208.3-law.gif\" itemprop=\"image\"><meta content=\"Celebrating Earl Scruggs\" property=\"twitter:title\"><meta content=\"Celebrating Earl Scruggs! #GoogleDoodle\" property=\"twitter:description\"><meta content=\"Celebrating Earl Scruggs! #GoogleDoodle\" property=\"og:description\"><meta content=\"summary_large_image\" property=\"twitter:card\"><meta content=\"@GoogleDoodles\" property=\"twitter:site\"><meta content=\"https://www.google.com/logos/doodles/2019/celebrating-earl-scruggs-5680695065182208-2xa.gif\" property=\"twitter:image\"><meta content=\"https://www.google.com/logos/doodles/2019/celebrating-earl-scruggs-5680695065182208-2xa.gif\" property=\"og:image\"><meta content=\"1000\" property=\"og:image:width\"><meta content=\"400\" property=\"og:image:height\"><meta content=\"https://www.google.com/logos/doodles/2019/celebrating-earl-scruggs-5680695065182208-2xa.gif\" property=\"og:url\"><meta content=\"video.other\" property=\"og:type\"><title>Google</title><script nonce=\"j0aPHCuRPUlftRzX2g6tTQ==\">(function(){window.google={kEI:'D1A4XKnVOo-6_wTgtpOgDA',kEXPI:'0,1353747,57,50,1907,1017,625,781,698,527,731,325,1124,349,30,1227,806,95,546,352,2335328,167,32,68,329226,1294,12383,4855,32692,2074,13173,867,10761,1402,6381,854,2481,2,2,6801,364,1165,7,2147,1262,4243,224,1017,1195,266,3742,1365,575,835,284,2,579,727,2069,363,58,2,1,3,933,364,4324,3397,302,658,610,291,482,2115,135,1407,1413,1529,395,525,621,5,2,2,1963,528,2067,182,283,2838,298,670,1044,1,468,1344,386,743,268,81,7,1,2,27,461,620,29,983,6,406,458,466,2,1379,769,536,428,267,2552,1739,313,876,412,2,554,2368,2,264,381,286,948,11,1209,38,363,557,270,303,145,155,499,285,433,42,1322,99,342,43,47,1080,543,1826,367,789,270,603,661,431,49,626,265,217,779,1531,35,2,4,2,670,44,226,1292,3,237,9,12,408,349,167,82,247,879,238,410,529,187,508,105,1,1496,5,12,620,464,87,99,25,178,283,278,6,38,53,290,390,37,117,9,81,345,103,17,112,7,203,173,81,2,83,340,14,617,604,58,351,614,175,97,1,1,2,177,803,60,264,88,5968727,2554,233,22,5997346,90,2800095,4,1572,549,332,445,1,2,80,1,900,583,4,309,1,8,1,2,2132,1,1,1,1,1,414,1,748,141,59,726,3,7,443,3,117,1,2,140,226,23,53,22306694',authuser:0,kscs:'c9c918f0_D1A4XKnVOo-6_wTgtpOgDA',kGL:'US'};google.kHL='en';})();google.time=function(){return(new Date).getTime()};(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute(\"eid\")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute(\"leid\")));)a=a.parentNode;return b};google.https=function(){return\"https:\"==window.location.protocol};google.ml=function(){return null};google.log=function(a,b,e,c,g){if(a=google.logUrl(a,b,e,c,g)){b=new Image;var d=google.lc,f=google.li;d[f]=b;b.onerror=b.onload=b.onabort=function(){delete d[f]};google.vel&&google.vel.lu&&google.vel.lu(a);b.src=a;google.li=f+1}};google.logUrl=function(a,b,e,c,g){var d=\"\",f=google.ls||\"\";e||-1!=b.search(\"&ei=\")||(d=\"&ei=\"+google.getEI(c),-1==b.search(\"&lei=\")&&(c=google.getLEI(c))&&(d+=\"&lei=\"+c));c=\"\";!e&&google.cshid&&-1==b.search(\"&cshid=\")&&\"slh\"!=a&&(c=\"&cshid=\"+google.cshid);a=e||\"/\"+(g||\"gen_204\")+\"?atyp=i&ct=\"+a+\"&cad=\"+b+d+f+\"&zx=\"+google.time()+c;/^http:/i.test(a)&&google.https()&&(google.ml(Error(\"a\"),!1,{src:a,glmm:1}),a=\"\");return a};}).call(this);(function(){google.y={};google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};}).call(this);google.f={};</scri" <> ...,
   headers: [
     {"Date", "Fri, 11 Jan 2019 08:13:03 GMT"},
     {"Expires", "-1"},
     {"Cache-Control", "private, max-age=0"},
     {"Content-Type", "text/html; charset=ISO-8859-1"},
     {"P3P", "CP=\"This is not a P3P policy! See g.co/p3phelp for more info.\""},
     {"Server", "gws"},
     {"X-XSS-Protection", "1; mode=block"},
     {"X-Frame-Options", "SAMEORIGIN"},
     {"Set-Cookie",
      "1P_JAR=2019-01-11-08; expires=Sun, 10-Feb-2019 08:13:03 GMT; path=/; domain=.google.com"},
     {"Set-Cookie",
      "NID=154=eRdDgOkW7gEdW7vRAPVM1Q7p3GKbBPOSH3yr07CL414Lmx740Jtk9WTPtl9RbGzWJ4QCetWtoQIjSbv_F-ML6Bs6_I9tt91ED_TD8ZKQrenqMr9ykhB7oBd8XoN7W5TqWNTy5jdlEjPFjwkAL42qTrjgGR2MJ5_jTphwwzVCKS8; expires=Sat, 13-Jul-2019 08:13:03 GMT; path=/; domain=.google.com; HttpOnly"},
     {"Alt-Svc", "quic=\":443\"; ma=2592000; v=\"44,43,39,35\""},
     {"Accept-Ranges", "none"},
     {"Vary", "Accept-Encoding"},
     {"Transfer-Encoding", "chunked"}
   ],
   request: %HTTPoison.Request{
     body: "",
     headers: [],
     method: :get,
     options: [],
     params: %{},
     url: "https://www.google.com"
   },
   request_url: "https://www.google.com",
   status_code: 200
 }}

Lastly, I make a request to get Yahoo's homepage

iex(3)> HTTPoison.get "https://www.yahoo.com"
{:ok,
 %HTTPoison.Response{
   body: "<!DOCTYPE html>\n<html lang=en>\n  <meta charset=utf-8>\n  <meta name=viewport content=\"initial-scale=1, minimum-scale=1, width=device-width\">\n  <title>Error 404 (Not Found)!!1</title>\n  <style>\n    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}\n  </style>\n  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>\n  <p><b>404.</b> <ins>That’s an error.</ins>\n  <p>The requested URL <code>/</code> was not found on this server.  <ins>That’s all we know.</ins>\n",
   headers: [
     {"Content-Type", "text/html; charset=UTF-8"},
     {"Referrer-Policy", "no-referrer"},
     {"Content-Length", "1561"},
     {"Date", "Fri, 11 Jan 2019 08:13:27 GMT"},
     {"Alt-Svc", "quic=\":443\"; ma=2592000; v=\"44,43,39,35\""}
   ],
   request: %HTTPoison.Request{
     body: "",
     headers: [],
     method: :get,
     options: [],
     params: %{},
     url: "https://www.yahoo.com"
   },
   request_url: "https://www.yahoo.com",
   status_code: 404
 }}
Brian
  • 857
  • 2
  • 12
  • 25
  • 1
    I have just tried through copying you requests and its working fine, i am getting 200 for each of them :/ – Tano Jan 11 '19 at 13:52
  • 1
    I also tried, and getting HTTP 200 for both requests. Can you see yahoo page in browser? Did you change hosts file? – Milan Jaric Jan 11 '19 at 13:57
  • 1
    I have tried it also, copy paste both requests to my iex shell. I get 200 and 302 status code (redirected). Have you tried curl? Maybe your IP is blacklisted and maybe this is their security configuration to prevent from web scrapping? – Hendri Tobing Jan 12 '19 at 20:04
  • I've tested on a few networks, and it seems to only happen behind the corporate firewall. It's really strange behavior in any case, so I'll need to run more tests, but suffice to say for now, it's proxy / firewall related. – Brian Jan 14 '19 at 08:31

0 Answers0