Stop Jtidy parsing if element is found

Question

Is there any way to only download and parse an XML document until an element is found using an XPathExpression? I'm using Java:

url = new URL("http://registroapps.uniandes.edu.co/scripts/adm_con_horario1_joomla.php?depto="+params[0]);
        Tidy tidy = new Tidy();
        tidy.setQuiet(true);
        tidy.setXHTML(true);    
        tidy.setShowWarnings(false);
        Document doc = tidy.parseDOM(url.openStream(), System.out);

        // Use XPath to obtain whatever you want from the (X)HTML
        XPath xpath = XPathFactory.newInstance().newXPath();
        XPathExpression expr = xpath.compile("//tr[td[normalize-space(font) = '"+params[1]+"']]/td/font/text()");
        NodeList result = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);

I'm getting the text from HTML documents like this one:

<table width="575" border="0" cellspacing="1" cellpadding="0">
                <tr> 
                  <td width="39" class="back1"><b class="texto4">CRN</b></td>
                  <td width="60" class="back1"><b class="texto4">Materia</b></td>
                  <td width="53" class="back1"><b class="texto4">Secci&oacute;n</b></td>
                  <td width="55" class="back1"><b class="texto4">Cr&eacute;ditos</b></td>
                  <td width="156" class="back1"><b class="texto4">T&iacute;tulo</b></td>
                  <td width="69" class="back1"><b class="texto4">Cupo</b></td>
                  <td width="57" class="back1"><b class="texto4">Inscritos</b></td>
                  <td width="77" class="back1"><b class="texto4">Disponible</b></td>
                </tr>
                <tr> 
                  <td width="39"><font class="texto4"> 
                    10110                        </font></td>
                  <td width="60"><font class="texto4"> 
                    IIND1000                        </font></td>
                  <td width="53"><font class="texto4"> 
                  <div align="center">
                    1                        </div></font></td>
                  <td width="55"><font class="texto4"> 
                    <div align="center">
                    3                       </div>
                    </font></td>
                  <td width="156"><font class="texto4"> 
                    INTROD. INGEN. INDUSTRIAL                        </font></td>
                  <td width="69"><font class="texto4"> 
                    100                        </font></td>
                  <td width="57"><font class="texto4"> 
                    100                        </font></td>
                  <td width="77"><font class="texto4"> 
                    0                        </font></td>
                </tr>
              </table>
<tr> 
            <td> 
              <table width="550" border="0" cellspacing="1" cellpadding="0">
                <tr> 
                  <td width="81" >&nbsp;</td>
                  <td width="172" class="back3" height="17"><b class="texto4">D&iacute;as</b></td>
                  <td width="171" class="back3" height="17"><b class="texto4">Horas</b></td>
                  <td width="171" class="back3" height="17"><b class="texto4">Sal&oacute;n</b></td>
                  <td width="171" class="back3"><b class="texto4">F. Inicial</b></td>
                  <td width="171" class="back3"><b class="texto4">F. Final</b></td>
                </tr>
                                    <tr> 
                  <td width="81" >&nbsp;</td>
                  <td width="172" height="17"><font class="texto4"> 
                        I                                </font></td>
                  <td width="171" height="17"><font class="texto4" > 
                    0700 - 0820                        </font></td>
                  <td width="171" height="17"><font class="texto4"> 
                    - -                        </font></td>
                  <td width="171"><font class="texto4" >28-JUL-14</font></td>
                  <td width="171"><font class="texto4" >15-NOV-14</font></td>
                </tr>
                                    <tr> 
                  <td width="81" ><div align="right"><span class="back3"><font class="texto4"><strong>Instructor(es)</strong>:</font></span></div></td>
                  <td width="172"  class="back3" height="17"><font class="texto4"><font class="texto4"> 
                    ALDANA VALDES EDUARDO                         </font></font></td>
                  <td width="171"  class="back3" height="17"><font class="texto4"> 
                                            </font></td>
                  <td width="171"  class="back3" height="17"><font class="texto4"></font></td>
                  <td width="171"  class="back3">&nbsp;</td>
                  <td width="171"  class="back3">&nbsp;</td>
                </tr>
              </table>                </td>
          </tr>

So, for instance, as soon as that XPathExpression finds code 10110 (params[1]=10110) on the first table, then I need for it not to download the next table. Instead, only all the text from the childs in the same level. The usual document size is over 10k lines and it becomes inefficient after a while, if the searched element is at the very beginning.

Please explain your actual task and why you think that would be a solution. — Tomalak, Jul 12 '14 at 08:20
@Tomalak I added both my code so far and an example HTML doc. Any ideas? — Hugo M. Zuleta, Jul 14 '14 at 05:23
As long as you are using a DOM parser there is no way to tell it to stop in the middle of the document. That's just not how DOM parsing works. Your options are: 1) Use a SAX parser. Those can stop reading a stream in the middle, but they are geared for XML and I doubt you can find one that works on HTML. Try searching anyway. 2) Use a faster DOM parser. Maybe there's a more efficient one than Tidy? 3) if your input is sufficiently predictable you could try to use indexOf() to identify a "target area" and the parse only part of the input. That's tricky to get right, I guess. — Tomalak, Jul 14 '14 at 07:36
The server you use is rather slow. It takes (for me) 3 seconds to download the 10,000 lines of HTML. Parsing and rendering it takes far less time, a few hundred milliseconds (in Chrome). So using a fast DOM parser won't speed up the whole process by much since a DOM parser still needs the whole document. A quick search for "Java HTML stream parser" lead me to the [Jericho HTML Parser](http://jericho.htmlparser.net/docs/index.html), which claims to have *"a stream based parsing option using the `StreamedSource` class"*. This (or something like it) is your best bet. — Tomalak, Jul 14 '14 at 09:46
I will try this. Make sure to delete these comments and change them for an answer so I can upvote and give it the green check. — Hugo M. Zuleta, Jul 14 '14 at 14:25
So far all I did was think about your problem and provide you with an idea. That's almost, but not quite an answer. :) If you try Jericho and find that it actually solves your problem, share your code by answering your own question. That's far more useful for future visitors than a green tick next to a raw idea. I'll be there to upvote you. — Tomalak, Jul 14 '14 at 14:32
I decided to go with Jsoup. It's the only parser that solves this problem:http://stackoverflow.com/questions/24668436/xpath-nodes-come-after-new-line I guess I'll have to stay with inefficient code. Thanks anyway! — Hugo M. Zuleta, Jul 16 '14 at 19:51
I don't even understand what "this problem" in the other thread is. There is a bunch of code but I can't find an error description. — Tomalak, Jul 16 '14 at 21:08
There is a bunch of spaces between the td and div tags, and Jtidy can't identify the div tags. That's why I can't get the text after the div tags. — Hugo M. Zuleta, Jul 16 '14 at 21:18
That's a red herring. Spaces and newlines in the HTML source code are ignored, the problem must be something else. — Tomalak, Jul 16 '14 at 23:48
Well, I'm telling you, it's only a problem in those cases, and Jsoup doesn't have it — Hugo M. Zuleta, Aug 01 '14 at 01:46

Stop Jtidy parsing if element is found

0 Answers0