4

I'm looking for an HTML or XML parser that lets one access the offset/position of the current element in the input string or file.

For example if walking through this string:

<div>
    <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>
    <p>sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
</div>

I'm looking for a way to get the starting position (including whitespace) of each <p> tag, here: 7 and 72.

It'd be great if a PHP parser supported that natively (I've looked at DOM, XMLReader, and other libraries mentionned in this SO question but haven't found a way to do it), but otherwise any language/framework would be fine.

Note: Related to this question, but less localized.

Community
  • 1
  • 1
julien_c
  • 4,942
  • 5
  • 39
  • 54
  • not exactly the same but: http://stackoverflow.com/questions/2530679/using-phps-xmlreader-how-do-i-get-the-line-number-of-the-current-node – Gordon Jan 23 '13 at 11:52
  • @Gordon What concerns me is that [DOMNode::getLineNo](http://php.net/manual/en/domnode.getlineno.php) seems to be pretty unreliable. If it's an underlying libxml2 bug as is asserted on that page, I'd probably need to find a non-libxml2-based solution. The other thing being that I would need the offset on the current line, not just the line number. – julien_c Jan 23 '13 at 12:05
  • I am curious why you would need that anyway. The point of a parser is to parse the serialized XML into a data structure of some sort, which you then modify and serialize back to XML. The information where in the original XML string a node is located seems irrelevant then. At least I don't see the UseCase. – Gordon Jan 23 '13 at 12:27
  • I'm building an EPUB reading system where "sentences" (sometimes spanning multiple XML nodes) are highlighted and their position is stored as start and end characters' offsets in the HTML file. – julien_c Jan 23 '13 at 12:33
  • I wrote an html parser for pascal that tracks the offset. Guess it will not help you much, although it also reads most xml files... – BeniBela Jan 23 '13 at 12:37
  • What are you working on? – oxygen Jan 28 '13 at 18:46
  • Is '7' the div (5) + new line (1?) + tab (1?) ? – Chris Jan 29 '13 at 10:54
  • @Chris Yes, I guess (number of characters) – julien_c Jan 29 '13 at 11:31

2 Answers2

6

Maybe you could use Generic XML parser class (also on github).
According to the author's description:

  • Parses arbitrary XML input and builds an array with the structure of all tag and data elements.
  • It can validate and extract data from a whole XML document with just a single call. It supports validationg common tag value data types and can perform custom validations using a subclass.
  • Optionally, keeps track of the positions of each element to allow the determination of the exact location of elements that may be contextually in error.
  • Supports parsed file cache to minimize the overhead of parsing the same file repeatdly.
  • Optimized parsing of simplified XML (SML) formats ignoring the tag attributes.
  • Validate and extract data from a whole XML document with single function call

I've tested it with this code:

<?php

require('xml_parser.php');

$file_name = 'test.xml';
$error = XMLParseFile($parser, $file_name, 1, $file_name.'.cache');

foreach ($parser->structure as $key => $val) {
    if (is_array($val) && isset($val['Tag']) && !strcasecmp($val['Tag'], 'p')) {
        print_r($parser->positions[$key]);
    }
}

?>

The test.xml file contains your sample HTML snippet.
By running the script from the command line I get this output:

Array
(
    [Line] => 2
    [Column] => 7
    [Byte] => 12
)
Array
(
    [Line] => 3
    [Column] => 7
    [Byte] => 80
)

So, the Byte field is probably what you're looking for.
For a better understanding of how it works, have also a look at its source code.

  • Thanks for your answer. I'm a bit concerned by the fact that the library seems a bit obscure – I'll keep on looking for now. – julien_c Jan 24 '13 at 20:00
  • Do you know if the library's still maintained? Any other suggestions, maybe for languages/etc. ? – julien_c Jan 29 '13 at 12:29
  • @julien_c The last documentation change is dated 2012-09-05, so I suppose that the library is still maintained. The library uses the [PHP Expat](http://php.net/manual/en/ref.xml.php) parser functions. For example, have a look at the [xml_get_current_byte_index](http://www.php.net/manual/en/function.xml-get-current-byte-index.php) function. –  Jan 29 '13 at 20:43
0

If you do not mind coding in Java (after Java code there is a solution in PHP), you can use indexOf method in String class, getting the offset if this token.

Here is an example:

class Index {
    public static void main ( String [] args )
    {   
        String token = "<p>";
        String input = "<p> hola </p> <p> adios </a>";
        int beginIdx = -1; 
        while ( (beginIdx = input.indexOf( token, beginIdx + 1 )) != -1 ) {                                                                                                                                         
            System.out.println( "Token at: " + beginIdx );
        }   
    }   
}

And the output is:

Token at: 0
Token at: 14

In PHP there is a similar function:

int strrpos ( string $haystack , string $needle [, int $offset = 0 ] )

You can have a quick look to the "man" page about it (it has some examples): http://php.net/manual/es/function.strrpos.php

arutaku
  • 5,937
  • 1
  • 24
  • 38
  • 1
    Not what the OP is looking for. This is not using an XML/HTML parser and will fail for any P elements not written exactly as `

    `, e.g. having attributes or uppercase.

    – Gordon Jan 23 '13 at 11:57
  • Then use a regular expression instead of a fixed string – arutaku Jan 23 '13 at 12:39
  • I doubt he will find a parser that cares about the location of the input string, because the whole intent parsers is to remove those sort of concerns. – Rimu Atkinson Jan 24 '13 at 04:54
  • Use stripos instead of strrpos because stripos is case insensitive, and just search for "

    "

    – Rimu Atkinson Jan 24 '13 at 04:55
  • @RimuAtkinson I'm not just looking for

    tags though (all kinds of tags)

    – julien_c Jan 24 '13 at 20:01