4

First of I wanted to apologize in case my question may not be provided with enough connect or anything of that matter, I'm typing this up on my phone right now.

So I'm working on a project that requires me to automate tasks within a webpage and in order to do that, step one is to access the page in the first place, but I've reached an obstacle that I've tried searching and figuring out with no avail.

The webpage I'm trying to reach had DDoS protection by CloudFlare, meaning before entering the page, your browser is checked for a couple seconds then let through.

I'm using the external library HtmlUnit which provides me with everything I will need and when accessing the page, I get a 503 error, saying I cannot access it, in fairly sure this is the protection blocking it.

Now my question is how should I bypass it. There is a .jar I decompiled and looked at which goes to the same site as me but it's far too illegible for me to make out.

Would appreciate help on this task so much, thanks.

For reference, here is an example of a webpage that uses CloudFare for testing, www.osbot.org (this isn't the site BTW).

If you need anything else please let me know and again sorry for text only, it's hard typing this up on my phone and I currently have no PC access.

Edit: Cannot whitelist my IP or get in contact with site owner

Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
SirRan
  • 59
  • 1
  • 1
  • 8

3 Answers3

2

I know this question is quite old, but there is no correct answer yet. Here is what works for me:

WebClient client = new WebClient(BrowserVersion.CHROME);

client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.getOptions().setRedirectEnabled(true);
client.getCache().setMaxSize(0);
client.waitForBackgroundJavaScript(10000);
client.setJavaScriptTimeout(10000);
client.waitForBackgroundJavaScriptStartingBefore(10000);

try {

    String url = "https://www.badlion.net/";

    HtmlPage page = client.getPage(url);

    synchronized(page) {
        page.wait(7000);
    }
    //Print cookies for test purposes. Comment out in production.
    URL _url = new URL(url);
    for(Cookie c : client.getCookies(_url)) {
        System.out.println(c.getName() +"="+c.getValue());
    }

    //This prints the content after bypassing Cloudflare.
    System.out.println(client.getPage(url).getWebResponse().getContentAsString());
} catch (FailingHttpStatusCodeException e) {
    e.printStackTrace();
} catch (MalformedURLException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
} catch(InterruptedException e) {
    e.printStackTrace();
}

Just replace String url = "https://badlion.net/"; with the URL you are attempting to access.

Raghav
  • 249
  • 2
  • 12
1

By default, HtmlUnit throws exception (which is not what real browsers do), and that is on purpose.

Anyhow, you can use webClient.getOptions().setThrowExceptionOnFailingStatusCode(false).

Also, you need to wait enough, below is an example:

    try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        String url = "http://www.osbot.org/";
        HtmlPage htmlPage = webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(10_000);
        System.out.println(htmlPage.asText());
    }
Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
  • 1
    I tested it and not working. Still the error is showing: "Checking your browser before accessing ....." "DDoS protection by Cloudflare " – Mohsen Abasi Mar 10 '18 at 13:47
  • This actually works for my test. But it just does not redirect to the final page. All cookies generated are valid and can be used. And by getting the same page again, you will finally reach the redirected page. – eos1d3 Nov 05 '18 at 16:40
-3

You should ask the site owner if they can whitelist your IPs. If you're doing anything like trying to scrape the site, then they may not want you to.

damoncloudflare
  • 2,079
  • 13
  • 9
  • Unfortunately this is not an option, is there a bypass to this? I see you seem to work for CloudFlare perhaps so you may not want to answer this, I understand. – SirRan Aug 26 '15 at 20:01
  • From my research there doesn't seem to be some obvious way to do it. I don't know if anyone is going to share basically extremely valuable hacking knowledge for free – Man Person Aug 26 '15 at 20:50
  • When using a regular URLConnection I get a 403 error but when adding user agent it worked and I was able to get all the source. I feel like I have to do the equivalent but not sure how to do it in HtmlUnit's api. – SirRan Aug 26 '15 at 22:18
  • Hmm that's interesting that just adding a user agent did the trick, cool find. I'm not sure either I would post that as a separate question – Man Person Aug 27 '15 at 14:12
  • Should the site owner allow the crawler to crawl the site data – Mohsen Abasi Mar 10 '18 at 13:49