I have some problems with HtmlUnit and the extraction of information from a website. At this website i find a table like this:
<table class="result" cellspacing="0" summary="Diese Tabelle zeigt die Ergebnisse Ihrer aktuellen Verbindungssuche">
<tr>
<th class="station first">Bahnhof/Haltestelle</th>
<th class="date">Datum</th>
<th class="time" colspan="2">Zeit</th>
<th class="duration">Dauer</th>
<th class="changes">Umst.</th>
<th class="products">
Produkte
</th>
<th class="fares" >
Preis für alle Reisenden*<a href="http://reiseauskunft.bahn.de/bin/help.exe/dn?ld=15076&seqnr=1&ident=c3.02648276.1441200542&rt=1&rememberSortType=minDeparture&tpl=popup&popupType=infoFares&" target="infoFares" class="pointer" onclick="showLayer(this,'infoLayerFares');Event.stop(event);"><img src="http://www.img-bahn.de/v/1508/img/info_rot_outline_16x16.png" id="fareTooltipImg" style="margin: 0; margin-left: 2px; vertical-align: middle;" height="16" width="16" alt="" title="" border="0"/></a></th><th class="return "> </th></tr><tr class="links"><td> </td><td class="date"> </td><td class="time" colspan="2"><a href="/bin/query.exe/dn?ld=15076&seqnr=1&ident=c3.02648276.1441200542&rt=1&rememberSortType=minDeparture&REQ0HafasScrollDir=2" class="arrowlink arrowlinktop" >Früher</a></td><td class="duration"> </td>
<td class="changes"> </td>
<td class="products"> </td>
<td class="fareStd">
Normalpreis
</td>
<td class="return"> </td>
</tr>
<tr class=" firstrow">
<td class="station first
pointer" onclick="triggerDispatcher(event,'linkDtlC0-0')">
<a class="block floatLeft iconLink open" id="linkDtlC0-0" href="http://reiseauskunft.bahn.de/bin/query2.exe/dn?ld=15076&seqnr=1&ident=c3.02648276.1441200542&rt=1&rememberSortType=minDeparture&HWAI=CONNECTION$C0-0!id=C0-0!HwaiConId=C0-0!HwaiDetailStatus=details!;~CONNECTION$C0-0!HwaiMapStatus=UNDEFINED!HwaiMapNumber=UNDEFINED!HwaiMapSlider=UNDEFINED!HwaiDetailHimMessage=UNDEFINED!;" rel="" title="Details zur Verbindung anzeigen"></a>
<div class="resultDep">
Arnsberg(Westf)
</div>
</td>
<td class="date">
Fr, 04.09.15</td>
<td class="timetx">
ab
</td>
<td class="time">
05:28
</td>
<td class="duration lastrow" rowspan="2">
9:19
</td>
<td class="changes lastrow" rowspan="2">5</td>
<td class="products lastrow" rowspan="2">
RE, ERB, WFB, NWB, ME
</td>
<td class=" fareStd lastrow button-inside tablebutton" rowspan="2"><span ><span class="fareOutput">79,90 EUR</span> <a href="" target="ext" onclick="popUp('about:blank','ext','cms')"></a></span><br /><span class="button-border "><a href="http://reiseauskunft.bahn.de/bin/query2.exe/dn?ld=15076&seqnr=1&ident=c3.02648276.1441200542&rt=1&rememberSortType=minDeparture&oCID=C0-0&waitForBooking=yes&lang=de&country=DEU&prepareOrder=yes&sTID=C0-0.0@1&hafasSessionExpires=0209151544&zielorth=Flensburg&zielorta=DEU&xcoorda=9436525&ycoorda=54774043&distancea=385&zielortm=Flensburg&services=hma&bcrvglpreis=7990&" title="Buchen Sie in den folgenden Schritten Ihre Fahrkarte/Sitzplatzreservierung für Arnsberg(Westf) - Flensburg ab 05:28 " class="buttonbold"><span>Zur Buchung</span></a></span> </td><td class="return lastrow button-inside tablebutton" rowspan="2">
<a class="arrowlink block returnJourney" href="http://reiseauskunft.bahn.de/bin/query2.exe/dn?ld=15076&seqnr=1&ident=c3.02648276.1441200542&rt=1&rememberSortType=minDeparture&guiVCtrl_connection_detailsOut_select_C0-0=yes&selectOutwardJourney=yes&selectReturnMode=yes&guiVCtrl_connection_detailsOut_add_selection=yes&" title="Fügen Sie zu dieser Hinfahrt eine Rückfahrt hinzu." >Rückfahrt hinzufügen</a>
</td>
</tr>
<tr class=" last">
<td class="station stationDest pointer" onclick="triggerDispatcher(event,'linkDtlC0-0')">
Flensburg
</td>
<td class="date">Fr, 04.09.15</td>
<td class="timetx">an</td>
<td class="time">
14:47
</td>
</tr>
<tr id="trC0-0" class="noHeight details">
<td colspan="9" >
<div id="updateC0-0"></div>
<div id="moreC0-0" class="moreDetailContainer">
</div>
</td>
</tr>
</table>
For this i have written the following codesnippet to extract the different HtmlTableDataCells:
final HtmlTable table = (HtmlTable) page.getFirstByXPath("//table[@class='result']");
List<HtmlTableRow> zeilen = table.getRows();
int i = 2;
//Idea: the current iterator i for the row must be greater then i = 2 and i must be less then zeilen.size() because the last row of the table doesn't hold any useful information, also two rows of the table are forming a Information Unit .. so zeilen.size()-1(useless tablerow)-1(Information Unit)
while(zeilen.size() > 2 && i < zeilen.size()-2)
{
HtmlDivision resultDep = zeilen.get(i).getFirstByXPath(".//td/div[@class='resultDep']");
HtmlTableDataCell startdatum = zeilen.get(i).getFirstByXPath(".//td[@class='date']");
HtmlTableDataCell startzeit = zeilen.get(i).getFirstByXPath(".//td[@class='time']");
HtmlTableDataCell dauer = zeilen.get(i).getFirstByXPath(".//td[@class='duration lastrow']");
HtmlTableDataCell umstiege = zeilen.get(i).getFirstByXPath(".//td[@class='changes lastrow']");
HtmlTableDataCell zuege = zeilen.get(i).getFirstByXPath(".//td[@class='products lastrow']");
//hole die Informationen aus der zweiten Zeile
//der letztliche Zielbahnhof
HtmlTableDataCell stationDest = zeilen.get(i+1).getFirstByXPath(".//td[@class='station stationDest pointer']");
HtmlTableDataCell enddatum = zeilen.get(i+1).getFirstByXPath(".//td[@class='date']");
HtmlTableDataCell endzeit = zeilen.get(i+1).getFirstByXPath(".//td[@class='time']");
//Achtung: hier wird eine Liste von preisen angelegt: der erste Eintrag beinhaltet den Sparpreis, der zweite Eintrag
List<HtmlSpan> preise = (List<HtmlSpan>) zeilen.get(i+1).getByXPath(".//span[@class='fareOutput']");
}
Essential i get the table and the rows. For every row i try to extract different informations, encoded directly in an HtmlTableDataCell or an div inside a HtmlTableDataCell. For that i use zeilen.get(i).getFirstByXPath(XPath) (getFirstByXPath on row no. i) and try an relative XPath to adress these DOM-Objects. But if i later try to get the objects out like this: resultDep.getTextContent() ... i just get java.lang.NullPointerException. So it seems, that my relative XPath is broken and i can't get the objects inside a row correctly addressed.
How can i use a relative XPath to address this issue?
I thank you all for your responses :-)
CU Sebastian