How do I preserve line breaks when extracting text out of an HtmlElement

Question

I am trying to extract three separate strings from: https://taxtest.navajocountyaz.gov/Pages/WebForm1.aspx?p=1&apn=103-03-122

The owners names: Johnson Tommy A & Nell H Cprs
The owners street address: 133 Maricopa Dr
The owners city, state and zip code, as one string: Winslow AZ 86047-2013

I tried the following code:

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.javascript.*;
import java.io.*;

public class PropertyOwner {

    public static void PropertyOwner () {

        try (final WebClient webClient = new WebClient()) {
            System.getProperties().put("org.apache.commons.logging.simplelog.defaultlog", "fatal");
            java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.OFF);

            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

            webClient.getOptions().setCssEnabled(false);
            webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
            webClient.setCssErrorHandler(new SilentCssErrorHandler());
            HtmlPage page = webClient.getPage("http://taxtest.navajocountyaz.gov/Pages/WebForm1.aspx?p=1&apn=103-03-122");
            webClient.waitForBackgroundJavaScriptStartingBefore(10000);     
            page = (HtmlPage) page.getEnclosingWindow().getEnclosedPage();
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
            HtmlTable pnlGridView_nextYear = (HtmlTable) page.getElementById("pnlGridView_nextYear");
            HtmlTableDataCell ownershipCell = (HtmlTableDataCell) pnlGridView_nextYear.getCellAt(0,0);
            String ownershipCellAsText = ownershipCell.toString();
            HtmlElement onwershipElement = (HtmlElement) page.getElementById("lblOwnership_NextYear");
            System.out.println("ownershipCellAsText = " + ownershipCellAsText);
            System.out.println("onwershipElement.getTextContent() = " + onwershipElement.getTextContent());


        }

        catch (Exception e) {
            System.out.println("Error: "+ e);
        }
  
    }
  
    public static void main(String[] args) {
        File file = new File("validParcelIDs.txt");
        PropertyOwner();
    }

}

I then used the following two commands:

> javac -classpath ".:/opt/htmlunit_2.69.0/*"  PropertyOwner.java
> java -classpath ".:/opt/htmlunit_2.69.0/*"  PropertyOwner

And got the following output:

ownershipCellAsText = HtmlTableDataCell[<td style="border:solid 1px black;">]
onwershipElement.getTextContent() = Johnson Tommy A & Nell H Cprs133 Maricopa DrWinslow AZ 86047-2013

As you can see, onwershipElement.getTextContent() is fairly close to what I want. Except that it removed the line breaks from the HtmlElement.

I tried the following solution, which was proposed over 8 years ago: Java getting text content from an element to include line breaks by adding just three lines of code to my program. The following three (non consecutive) lines:

import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
.....
WebView webView = new WebView();

And that gave me the following compiling error:

achab@HP-Envy [Navajo] $javac -classpath ".:/opt/htmlunit_2.69.0/*"  PropertyOwner.java 
PropertyOwner.java:15: error: cannot find symbol
            WebView webView = new WebView(); 
            ^
  symbol:   class WebView
  location: class PropertyOwner
PropertyOwner.java:15: error: cannot find symbol
            WebView webView = new WebView(); 
                                  ^
  symbol:   class WebView
  location: class PropertyOwner
2 errors

So, it seems like that solution is outdated. The 2.69.0 release of HtmlUnit was released January 5, 2023.

Before that. I had tried 2.47.1 release of HtmlUnit, which was released about two years ago. With the same two problems described above: failure to preserve line breaks in the first version of the code, and not finding the symbol WebView in the second version of the code.

What do I need to change in order to get the three separate strings that I want?

score 1 · Answer 1 · answered Jan 11 '23 at 23:41

1

Instead of onwershipElement.getTextContent() use onwershipElement.asNormalizedText() .

answered Jan 11 '23 at 23:41

Eritrean

15,851
3
22
28

How do I preserve line breaks when extracting text out of an HtmlElement

1 Answers1