Java jsoup html parsing robot index / bot detection, noindex

Question

Firstly, this is not a duplicated question, because I have already checked almost all 503 / robot index problems. None of them solved my problem. I am trying to get giveaway list from indiegala.com but this site has some kind of protection to prevent bots and robots. My purpose is not illegal, I just want to get giveaway list then check games whether they have steam trade cards or not. But right know, indiegala gives me a robot index. Currently I am using that code;

       String url = "https://www.indiegala.com/giveaways";
    try {
        String content = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36").ignoreHttpErrors(true).followRedirects(true).get().html();
        System.out.println(content);
    } catch (IOException ex) {
        System.out.println(ex.toString());
    }

To see the output(source of the site, in my code, variable "content"), you can run the code that I gave, I cannot add output here because it is a little bit long. But it looks like that;

<head>
 <meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
</head>

So how can I pass this protection ? Can my program pretend like a human to pass this protection ?

score 3 · Answer 1 · answered Jul 13 '16 at 00:23

I've had a look at your case, and have worked out how to bypass the robot detection.

What you need is cookies. See below code:

String url = "https://www.indiegala.com/giveaways";

Document doc = Jsoup.connect(url)
            .userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36")
            .header("cookie", "incap_ses_436_255598=zI1vN7X6+BY84PhGvPsMBjKChVcAAAAAVhJ+1//uCecPhV2QjUMw6w==")
            .timeout(0)
            .get();

This looks like a particular cookies that the website requires, and adding it to the header has given me successfully the actual website content :)

NOTE: Generally if you encounter situations like this, you can easily use the Chrome developer tool to inspect the request sent by Chrome, then replicate it in your Jsoup request :)

Thanks for your answer :) when I try this code, it gives me "403 HTTP error fetching URL. Status=403". Then I tried to add ignoreHttpErrors(true). Code works without errors, but still gives me robot index :/ I think I need my own cookie :) but I don't know how I can get one for myself :) — david_caruso, Jul 13 '16 at 03:59

score 3 · Answer 2 · edited Nov 17 '17 at 14:32

That is my Case. That might help. The robot detector detected my browser agent and showed the well-known captcha "Please Show I am not Robot". First, by using the chrome plugin located in this address The Header passed to the website was shown and the cookies and userAgent were known. I just Copied the cookie and userAgent shown there in my code and each time the robot is detected I manually bypass the captcha with my resident browser.

Doc = Jsoup.connect(URL_String)
  .userAgent("Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36")    
  .header("cookie","AWSALB=7ygHW4oBnXOkLMVFehmoTM8F1lLfDiTJVVeP5DTIw4dpGgQ4o2F5mYYm4bvCkJul1nkWqAjq9s0pKojKFqdP7wRm/NX/Ye2ntYKwtlOhVvA4dwSM8QTn1uwi4jgI; Expires=Fri, 24 Nov 2017 11:37:10 GMT; Path=/")
  .timeout(0)
  .get();

Java jsoup html parsing robot index / bot detection, noindex

2 Answers2