I am trying to extract three separate strings from: https://taxtest.navajocountyaz.gov/Pages/WebForm1.aspx?p=1&apn=103-03-122
- The owners names: Johnson Tommy A & Nell H Cprs
- The owners street address: 133 Maricopa Dr
- The owners city, state and zip code, as one string: Winslow AZ 86047-2013
I tried the following code:
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.javascript.*;
import java.io.*;
public class PropertyOwner {
public static void PropertyOwner () {
try (final WebClient webClient = new WebClient()) {
System.getProperties().put("org.apache.commons.logging.simplelog.defaultlog", "fatal");
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.OFF);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setCssEnabled(false);
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
webClient.setCssErrorHandler(new SilentCssErrorHandler());
HtmlPage page = webClient.getPage("http://taxtest.navajocountyaz.gov/Pages/WebForm1.aspx?p=1&apn=103-03-122");
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
page = (HtmlPage) page.getEnclosingWindow().getEnclosedPage();
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
HtmlTable pnlGridView_nextYear = (HtmlTable) page.getElementById("pnlGridView_nextYear");
HtmlTableDataCell ownershipCell = (HtmlTableDataCell) pnlGridView_nextYear.getCellAt(0,0);
String ownershipCellAsText = ownershipCell.toString();
HtmlElement onwershipElement = (HtmlElement) page.getElementById("lblOwnership_NextYear");
System.out.println("ownershipCellAsText = " + ownershipCellAsText);
System.out.println("onwershipElement.getTextContent() = " + onwershipElement.getTextContent());
}
catch (Exception e) {
System.out.println("Error: "+ e);
}
}
public static void main(String[] args) {
File file = new File("validParcelIDs.txt");
PropertyOwner();
}
}
I then used the following two commands:
> javac -classpath ".:/opt/htmlunit_2.69.0/*" PropertyOwner.java
> java -classpath ".:/opt/htmlunit_2.69.0/*" PropertyOwner
And got the following output:
ownershipCellAsText = HtmlTableDataCell[<td style="border:solid 1px black;">]
onwershipElement.getTextContent() = Johnson Tommy A & Nell H Cprs133 Maricopa DrWinslow AZ 86047-2013
As you can see, onwershipElement.getTextContent() is fairly close to what I want. Except that it removed the line breaks from the HtmlElement.
I tried the following solution, which was proposed over 8 years ago: Java getting text content from an element to include line breaks by adding just three lines of code to my program. The following three (non consecutive) lines:
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
.....
WebView webView = new WebView();
And that gave me the following compiling error:
achab@HP-Envy [Navajo] $javac -classpath ".:/opt/htmlunit_2.69.0/*" PropertyOwner.java
PropertyOwner.java:15: error: cannot find symbol
WebView webView = new WebView();
^
symbol: class WebView
location: class PropertyOwner
PropertyOwner.java:15: error: cannot find symbol
WebView webView = new WebView();
^
symbol: class WebView
location: class PropertyOwner
2 errors
So, it seems like that solution is outdated. The 2.69.0 release of HtmlUnit was released January 5, 2023.
Before that. I had tried 2.47.1 release of HtmlUnit, which was released about two years ago. With the same two problems described above: failure to preserve line breaks in the first version of the code, and not finding the symbol WebView in the second version of the code.
What do I need to change in order to get the three separate strings that I want?