0

I am having trouble retrieving the list of Dom Elements when using the method getElementsByName from HtmlPage.

Here is the HTML Page. (Trying to get the CategoriaAgente from the select tag).

HTML (The part that I need):

<select name="CategoriaAgente">
  <option value="-">Escolha uma categoria</option>
  <option value="t">Todos</option>
  <option value="p">Permissionária de Distribuição</option>
  <option value="d">Concessionária de Distribuição</option>
</select>

Snippet of the Java code (Using HtmlUnit):

    public List<HtmlOption> listaAgentes() {
    List<HtmlOption> listaAgentes = null;

    try (WebClient webClient = new WebClient()) {
        log.info("COLETANDO AGENTES");

        // parâmetros do webclient
        webClient.setJavaScriptTimeout(15000);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(300000);

        String url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        HtmlPage page = webClient.getPage(url);
        
        // SELECIONAR CATEGORIA AGENTE
        List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");
       //... 

The list listaCategoriaAgente is ALWAYS empty. I tried some solutions found on S.O. but none of them works. Help? Thanks in advance!

EDIT: After the comment from @hooknc , I found that the page is looking for some kind of captcha from cloudfare. This is what I get from POSTMAN....

enter image description here

Someone knows how to bypass this challenge-form using HtmlUnit? Thanks!!!!!

EDIT 2:

Well, I think I made some progress(?)...

This is the code so far....

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        webClient.getOptions().setCssEnabled(false);
        webClient.setJavaScriptTimeout(0);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(0);
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getCache().setMaxSize(0);
        webClient.waitForBackgroundJavaScript(10_000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10_000);

        HtmlPage page = null;
        String url = null;

        url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        page = webClient.getPage(url);

        if (page.asXml().contains("Checking if the site connection is secure")) {
            log.info(page.asXml());

            synchronized(page) {
                page.wait(10_000);
            }
            webClient.waitForBackgroundJavaScript(10_000);
        }

And... this is what I get from the log...

<div id="challenge-success" style="display: none;">
      <div class="h2">
        <span class="icon-wrapper">
          <img class="heading-icon" alt="Success icon" src=""/>
        </span>
        Connection is secure
      </div>
      <div class="core-msg spacer">
        Proceeding...
      </div>
    </div>

So... It says Proceeding... but nothing happens... I waited 4ever, but it just stucks on the Proceeding...

Any thoughts?? Thanks!!!

gbossa
  • 377
  • 1
  • 3
  • 18
  • I have been testing the www2 version of your website and it seems that you're being hosted by cloudflare. The first page that comes back seems to be testing your browser and I am unsure of how to get around that check. Perhaps there is something you could do to turn off that security check at cloudflare? – hooknc Nov 29 '22 at 21:38
  • The google term to use is cloudflare Checking if the site connection is secure. Hope that helps. – hooknc Nov 29 '22 at 21:39
  • Thanks for the tip. It seems the right direction. But until now, I have not found a solution to bypass the cloudfare security check. The odd thing is: it worked before. It suddenly stopped working.... – gbossa Nov 30 '22 at 16:08
  • I found that the page has a `challenge-form`. Posibly a captcha. Please, do you know how can I bypass it? Thanks! – gbossa Nov 30 '22 at 17:07
  • I have zero idea. I did a google search on 'cloudflare Checking if the site connection is secure' and the best answer I could find is this one: [Cloudflare Checking if the site connection is secure](https://community.cloudflare.com/t/cloudflare-checking-if-the-site-connection-is-secure/419107/20). I would suggest trying to contact the admin folks that manage that site for your company and ask them to disable that check and/or add your company's ip address block to a safe list. I am unsure if this helps or not, you are not the only one to be affected by this issue. Good luck. – hooknc Nov 30 '22 at 17:22
  • 1
    Well, I am in trouble... The website I am tryng to access is from the Government. But I will keep on searching. When I have a solution I wil post it. Thanks anyway bro. – gbossa Nov 30 '22 at 20:39
  • 1
    It worked.. what I did was to update the version of htmlunit (had to update the apache-commons-lang3 too) and it worked. A guy from htmlunit read another of my questions and released an updated version to solve the cookie problem. And voilá. Worked. Will prepare a proper answer. – gbossa Dec 15 '22 at 13:24

1 Answers1

0

Well. This is what happened. I posted (a related) question, and a guy (possibly from the htmlunit crew) posted an update on git to solve the cookie problem. When using that updated version (2.68.0-SNAPSHOT - and I had to update the version of apache-commons-lang3 too) all the problems disappeared. Cloudflare accepted the connection and everything worked! Here is the final version of the code....

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        String url = "https://www2.aneel.gov.br:443/aplicacoes_liferay/tarifa/";
        
        // parâmetros do webclient
        webClient.getOptions().setCssEnabled(true);
        webClient.setJavaScriptTimeout(0);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(0);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setRedirectEnabled(true);
        
        CookieManager cookies = new CookieManager();            
        cookies.setCookiesEnabled(true);
        webClient.setCookieManager(cookies);
        
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        
        webClient.waitForBackgroundJavaScript(10000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10000);
        
        webClient.getCache().setMaxSize(0);
        
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
        java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
        java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
        
        HtmlPage page = webClient.getPage(url);
        webClient.getRefreshHandler().handleRefresh(page, new URL(url), 10);
        
        synchronized(page) {
            page.wait(10000);
        }
        
        if (page.asXml().contains("Checking if the site connection is secure")) {
            log.info(page.asXml());
            webClient.waitForBackgroundJavaScript(10_000);
        }

        List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");

With the updates, and this piece of code, the list of DOM Elements I needed came properly. Thank you all for the assist!

gbossa
  • 377
  • 1
  • 3
  • 18