0

I have some problems with HtmlUnit and the extraction of information from a website. At this website i find a table like this:

<table class="result" cellspacing="0" summary="Diese Tabelle zeigt die Ergebnisse Ihrer aktuellen Verbindungssuche">
    <tr>
<th class="station first">Bahnhof/Haltestelle</th>
<th class="date">Datum</th>
<th class="time" colspan="2">Zeit</th>
<th class="duration">Dauer</th>
<th class="changes">Umst.</th>
<th class="products">
Produkte
</th>
<th class="fares" >
Preis f&#252;r alle Reisenden*<a href="http://reiseauskunft.bahn.de/bin/help.exe/dn?ld=15076&amp;seqnr=1&amp;ident=c3.02648276.1441200542&amp;rt=1&amp;rememberSortType=minDeparture&amp;tpl=popup&amp;popupType=infoFares&amp;" target="infoFares" class="pointer" onclick="showLayer(this,'infoLayerFares');Event.stop(event);"><img src="http://www.img-bahn.de/v/1508/img/info_rot_outline_16x16.png" id="fareTooltipImg" style="margin: 0; margin-left: 2px; vertical-align: middle;"  height="16" width="16" alt="" title="" border="0"/></a></th><th class="return ">&nbsp;</th></tr><tr class="links"><td>&nbsp;</td><td class="date">&nbsp;</td><td class="time" colspan="2"><a href="/bin/query.exe/dn?ld=15076&amp;seqnr=1&amp;ident=c3.02648276.1441200542&amp;rt=1&amp;rememberSortType=minDeparture&amp;REQ0HafasScrollDir=2" class="arrowlink arrowlinktop" >Fr&#252;her</a></td><td class="duration">&nbsp;</td>
<td class="changes">&nbsp;</td>
<td class="products">&nbsp;</td>
<td class="fareStd">
Normalpreis
</td>
<td class="return">&nbsp;</td>
</tr>
<tr class=" firstrow">
<td class="station first
pointer" onclick="triggerDispatcher(event,'linkDtlC0-0')">
<a class="block floatLeft iconLink open" id="linkDtlC0-0" href="http://reiseauskunft.bahn.de/bin/query2.exe/dn?ld=15076&amp;seqnr=1&amp;ident=c3.02648276.1441200542&amp;rt=1&amp;rememberSortType=minDeparture&amp;HWAI=CONNECTION$C0-0!id=C0-0!HwaiConId=C0-0!HwaiDetailStatus=details!;~CONNECTION$C0-0!HwaiMapStatus=UNDEFINED!HwaiMapNumber=UNDEFINED!HwaiMapSlider=UNDEFINED!HwaiDetailHimMessage=UNDEFINED!;" rel="" title="Details zur Verbindung anzeigen"></a>
<div class="resultDep">
Arnsberg(Westf)  
</div>
</td>
<td class="date">
Fr, 04.09.15</td>
<td class="timetx">
ab
</td>
<td class="time">
05:28
&nbsp;</td>
<td class="duration lastrow" rowspan="2">
9:19
</td>
<td class="changes lastrow" rowspan="2">5</td>
<td class="products lastrow" rowspan="2">
RE, ERB, WFB, NWB, ME
</td>
<td class=" fareStd lastrow button-inside tablebutton" rowspan="2"><span ><span class="fareOutput">79,90&nbsp;EUR</span>&nbsp;<a href="" target="ext" onclick="popUp('about:blank','ext','cms')"></a></span><br /><span class="button-border "><a href="http://reiseauskunft.bahn.de/bin/query2.exe/dn?ld=15076&amp;seqnr=1&amp;ident=c3.02648276.1441200542&amp;rt=1&amp;rememberSortType=minDeparture&amp;oCID=C0-0&amp;waitForBooking=yes&amp;lang=de&amp;country=DEU&amp;prepareOrder=yes&amp;sTID=C0-0.0@1&amp;hafasSessionExpires=0209151544&amp;zielorth=Flensburg&amp;zielorta=DEU&amp;xcoorda=9436525&amp;ycoorda=54774043&amp;distancea=385&amp;zielortm=Flensburg&amp;services=hma&amp;bcrvglpreis=7990&amp;" title="Buchen Sie in den folgenden Schritten Ihre Fahrkarte/Sitzplatzreservierung f&#252;r Arnsberg(Westf) - Flensburg ab 05:28 "  class="buttonbold"><span>Zur&nbsp;Buchung</span></a></span> </td><td class="return lastrow button-inside tablebutton" rowspan="2">
<a class="arrowlink block returnJourney" href="http://reiseauskunft.bahn.de/bin/query2.exe/dn?ld=15076&amp;seqnr=1&amp;ident=c3.02648276.1441200542&amp;rt=1&amp;rememberSortType=minDeparture&amp;guiVCtrl_connection_detailsOut_select_C0-0=yes&amp;selectOutwardJourney=yes&amp;selectReturnMode=yes&amp;guiVCtrl_connection_detailsOut_add_selection=yes&amp;" title="F&#252;gen Sie zu dieser Hinfahrt eine R&#252;ckfahrt hinzu." >R&#252;ckfahrt hinzuf&#252;gen</a>
</td>
</tr>
<tr class=" last">
<td class="station stationDest pointer" onclick="triggerDispatcher(event,'linkDtlC0-0')">
Flensburg 
</td>
<td class="date">Fr, 04.09.15</td>
<td class="timetx">an</td>
<td class="time">
14:47
</td>
</tr>
<tr id="trC0-0" class="noHeight details">
<td colspan="9" >
<div id="updateC0-0"></div>
<div id="moreC0-0" class="moreDetailContainer">
</div>
</td>
</tr>
</table>    

For this i have written the following codesnippet to extract the different HtmlTableDataCells:

final HtmlTable table = (HtmlTable) page.getFirstByXPath("//table[@class='result']");
              List<HtmlTableRow> zeilen = table.getRows();              
              int i = 2;              
                //Idea: the current iterator i for the row must be greater then i = 2 and i must be less then zeilen.size() because the last row of the table doesn't hold any useful information, also two rows of the table are forming a Information Unit .. so zeilen.size()-1(useless tablerow)-1(Information Unit)  
                while(zeilen.size() > 2 && i < zeilen.size()-2)
                {
                  HtmlDivision resultDep = zeilen.get(i).getFirstByXPath(".//td/div[@class='resultDep']");
                  HtmlTableDataCell startdatum = zeilen.get(i).getFirstByXPath(".//td[@class='date']");
                  HtmlTableDataCell startzeit  = zeilen.get(i).getFirstByXPath(".//td[@class='time']");
                  HtmlTableDataCell dauer      = zeilen.get(i).getFirstByXPath(".//td[@class='duration lastrow']");
                  HtmlTableDataCell umstiege   = zeilen.get(i).getFirstByXPath(".//td[@class='changes lastrow']");
                  HtmlTableDataCell zuege      = zeilen.get(i).getFirstByXPath(".//td[@class='products lastrow']");
                  //hole die Informationen aus der zweiten Zeile
                  //der letztliche Zielbahnhof
                  HtmlTableDataCell stationDest = zeilen.get(i+1).getFirstByXPath(".//td[@class='station stationDest pointer']");
                  HtmlTableDataCell enddatum = zeilen.get(i+1).getFirstByXPath(".//td[@class='date']");
                  HtmlTableDataCell endzeit  = zeilen.get(i+1).getFirstByXPath(".//td[@class='time']");
                  //Achtung: hier wird eine Liste von preisen angelegt: der erste Eintrag beinhaltet den Sparpreis, der zweite Eintrag 
                  List<HtmlSpan> preise = (List<HtmlSpan>) zeilen.get(i+1).getByXPath(".//span[@class='fareOutput']");
                }

Essential i get the table and the rows. For every row i try to extract different informations, encoded directly in an HtmlTableDataCell or an div inside a HtmlTableDataCell. For that i use zeilen.get(i).getFirstByXPath(XPath) (getFirstByXPath on row no. i) and try an relative XPath to adress these DOM-Objects. But if i later try to get the objects out like this: resultDep.getTextContent() ... i just get java.lang.NullPointerException. So it seems, that my relative XPath is broken and i can't get the objects inside a row correctly addressed.

How can i use a relative XPath to address this issue?

I thank you all for your responses :-)

CU Sebastian

  • Your code doesn't increment `i`. And it passes with latest [build](https://ci.canoo.com/teamcity/viewLog.html?buildTypeId=HtmlUnit_FastBuild&buildId=lastSuccessful&tab=artifacts), which version do you use? – Ahmed Ashour Sep 03 '15 at 12:37

1 Answers1

0

thx for your Answer ... actually I increment "i", but for clarity and briefness I omitted this. The while-loop runs through all the rows. The Syntax of the code is ok, but i'm unsure that ".//td/div" or ".//td" are the correct relative XPaths to get the DOMObjects inside the rows. Has somebody ever tried something like this?

Thx,

Sebastian