Which HTML Parser (preferably PHP) supports getting the offset of the current node in the input string?

Question

I'm looking for an HTML or XML parser that lets one access the offset/position of the current element in the input string or file.

For example if walking through this string:

<div>
    <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>
    <p>sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
</div>

I'm looking for a way to get the starting position (including whitespace) of each <p> tag, here: 7 and 72.

It'd be great if a PHP parser supported that natively (I've looked at DOM, XMLReader, and other libraries mentionned in this SO question but haven't found a way to do it), but otherwise any language/framework would be fine.

Note: Related to this question, but less localized.

not exactly the same but: http://stackoverflow.com/questions/2530679/using-phps-xmlreader-how-do-i-get-the-line-number-of-the-current-node — Gordon, Jan 23 '13 at 11:52
@Gordon What concerns me is that [DOMNode::getLineNo](http://php.net/manual/en/domnode.getlineno.php) seems to be pretty unreliable. If it's an underlying libxml2 bug as is asserted on that page, I'd probably need to find a non-libxml2-based solution. The other thing being that I would need the offset on the current line, not just the line number. — julien_c, Jan 23 '13 at 12:05
I am curious why you would need that anyway. The point of a parser is to parse the serialized XML into a data structure of some sort, which you then modify and serialize back to XML. The information where in the original XML string a node is located seems irrelevant then. At least I don't see the UseCase. — Gordon, Jan 23 '13 at 12:27
I'm building an EPUB reading system where "sentences" (sometimes spanning multiple XML nodes) are highlighted and their position is stored as start and end characters' offsets in the HTML file. — julien_c, Jan 23 '13 at 12:33
I wrote an html parser for pascal that tracks the offset. Guess it will not help you much, although it also reads most xml files... — BeniBela, Jan 23 '13 at 12:37

score 6 · Accepted Answer · answered Jan 23 '13 at 21:33

Maybe you could use Generic XML parser class (also on github).
According to the author's description:

Parses arbitrary XML input and builds an array with the structure of all tag and data elements.
It can validate and extract data from a whole XML document with just a single call. It supports validationg common tag value data types and can perform custom validations using a subclass.
Optionally, keeps track of the positions of each element to allow the determination of the exact location of elements that may be contextually in error.
Supports parsed file cache to minimize the overhead of parsing the same file repeatdly.
Optimized parsing of simplified XML (SML) formats ignoring the tag attributes.
Validate and extract data from a whole XML document with single function call

I've tested it with this code:

<?php

require('xml_parser.php');

$file_name = 'test.xml';
$error = XMLParseFile($parser, $file_name, 1, $file_name.'.cache');

foreach ($parser->structure as $key => $val) {
    if (is_array($val) && isset($val['Tag']) && !strcasecmp($val['Tag'], 'p')) {
        print_r($parser->positions[$key]);
    }
}

?>

The test.xml file contains your sample HTML snippet.
By running the script from the command line I get this output:

Array
(
    [Line] => 2
    [Column] => 7
    [Byte] => 12
)
Array
(
    [Line] => 3
    [Column] => 7
    [Byte] => 80
)

So, the Byte field is probably what you're looking for.
For a better understanding of how it works, have also a look at its source code.

Thanks for your answer. I'm a bit concerned by the fact that the library seems a bit obscure – I'll keep on looking for now. — julien_c, Jan 24 '13 at 20:00
Do you know if the library's still maintained? Any other suggestions, maybe for languages/etc. ? — julien_c, Jan 29 '13 at 12:29
@julien_c The last documentation change is dated 2012-09-05, so I suppose that the library is still maintained. The library uses the [PHP Expat](http://php.net/manual/en/ref.xml.php) parser functions. For example, have a look at the [xml_get_current_byte_index](http://www.php.net/manual/en/function.xml-get-current-byte-index.php) function. — , Jan 29 '13 at 20:43

score 0 · Answer 2 · answered Jan 23 '13 at 11:52

0

If you do not mind coding in Java (after Java code there is a solution in PHP), you can use indexOf method in String class, getting the offset if this token.

Here is an example:

class Index {
    public static void main ( String [] args )
    {   
        String token = "<p>";
        String input = "<p> hola </p> <p> adios </a>";
        int beginIdx = -1; 
        while ( (beginIdx = input.indexOf( token, beginIdx + 1 )) != -1 ) {                                                                                                                                         
            System.out.println( "Token at: " + beginIdx );
        }   
    }   
}

And the output is:

Token at: 0
Token at: 14

In PHP there is a similar function:

int strrpos ( string $haystack , string $needle [, int $offset = 0 ] )

You can have a quick look to the "man" page about it (it has some examples): http://php.net/manual/es/function.strrpos.php

answered Jan 23 '13 at 11:52

arutaku

5,937
1
24
38

1

Not what the OP is looking for. This is not using an XML/HTML parser and will fail for any P elements not written exactly as `
`, e.g. having attributes or uppercase.
– Gordon Jan 23 '13 at 11:57
Then use a regular expression instead of a fixed string – arutaku Jan 23 '13 at 12:39
I doubt he will find a parser that cares about the location of the input string, because the whole intent parsers is to remove those sort of concerns. – Rimu Atkinson Jan 24 '13 at 04:54
Use stripos instead of strrpos because stripos is case insensitive, and just search for "
"
– Rimu Atkinson Jan 24 '13 at 04:55
@RimuAtkinson I'm not just looking for
tags though (all kinds of tags)
– julien_c Jan 24 '13 at 20:01

Which HTML Parser (preferably PHP) supports getting the offset of the current node in the input string?

2 Answers2