0

Let's say I have something like this <data>some 'text'</data>, expat has no problem parsing this.
Now if I have this: <data>'<some text>'</data> it freaks out about a mismatched tag, which is due to < being found.

Unfortunately I can't just escape all < and > because that will result in not well-formed since there is no longer a start tag. Is there a simple way to get around this? The only way I can think is making a regular expression to escape < and > if they are found within a quote.

EDIT: The actual portion that breaks it:

<script type='text/javascript'>
(function() {
var useSSL = 'https:' == document.location.protocol;
var src = (useSSL ? 'https:' : 'http:') +
'//www.googletagservices.com/tag/js/gpt.js';
document.write('<scr' + 'ipt src="' + src + '"></scr' + 'ipt>');
})();
</script>
Chrispresso
  • 3,660
  • 2
  • 19
  • 31
  • 4
    You've got broken xml. You should fix the xml rather than trying to break the parsers. `` is not valid, because it'll appear as a singleton tag `some` with an invalid attribute `data`. – Marc B Nov 24 '14 at 20:43
  • As much as I would love to change the xml I can't. I'm parsing xhtml from a website and a hacked up `document.write(' – Chrispresso Nov 24 '14 at 20:45
  • @ZWiki. That cannot be valid xhtml, unless the `document.write(' – ekhumoro Nov 24 '14 at 20:50
  • @ekhumoro, added the section that breaks it. It is not inside ``, even just trying to parse that alone breaks it – Chrispresso Nov 24 '14 at 20:54
  • Angle brackets within a quote won't work, because it's perfectly valid to have, e.g., `
    Here's some '
    quoted
    ' text
    `.
    – abarnert Nov 24 '14 at 20:59
  • @ZWiki. That sample is not valid xml (or xhmtl). The script block would need to be enclosed in a cdata section, like this: ``. – ekhumoro Nov 24 '14 at 21:01
  • Also, what are you trying to _do_ with this page after parsing it? – abarnert Nov 24 '14 at 21:15

1 Answers1

2

Assuming your bad (X)HTML is all consistent with this example, the rule seems pretty obvious: You want to treat script tags as if they were cdata. That isn't valid, but that gives you something relatively simple that you can write and apply to your page before parsing it. You could either cdata-fy the script body, quote angle brackets within the script body, or whatever else you find appropriate. Then you'll have valid markup (or maybe you'll just have the next error to deal with) that you can successfully parse. (Without knowing what you're trying to do with the data beyond parsing, most likely nobody can suggest anything too much more specific.)


The rule you suggested, "making a regular expression to escape < and > if they are found within a quote", is clearly not going to work. Consider how this would affect these two fragments:

<div id='normal'>Here is some '<div id='quoted'>quoted</div>' text</div>
<div id='normal'>Here's some '<div id='quoted'>quoted</div>' text</div>

And that's even besides the issue that, even if the language you're suggesting were not ambiguous, it still wouldn't be a regular language.


Also, it's worth asking whether this is actually XML in the first place. If it's XHTML, it's got additional problems—e.g., document.write does not exist in the XHTML DOM. It might be the XML serialization profile for HTML5, but it might just be HTML5 or HTML 4.01, in which case you shouldn't be trying to parse it as XML in the first place.


You may also want to consider using a more liberal parser. Trying beautifulsoup4 with each of the parsers it knows how to use (lxml in XML, HTML mode, and HTML5 mode, and html.parser, and html5lib) until you find one that works consistently can be a good quick&dirty solution to broken markup.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Pretty sure it's XHTML based off this: ``. I don't know much about websites so I didn't figure all that stuff mattered, I was just trying to parse it into a quick tree structure – Chrispresso Nov 24 '14 at 21:07
  • @ZWiki: First, put that in the question so we don't have to guess. Second, try running an online XHTML validator against the page before just assuming that you should be able to parse it. (Although a validator is unlikely to catch DOM errors in the JS…) – abarnert Nov 24 '14 at 21:10
  • @ZWiki. But it's not **valid** xhtml, which is why you can't parse it. If you want to quickly parse junk pages like this, just use something like [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/). – ekhumoro Nov 24 '14 at 21:12
  • @abarnert. I just tried the example code in the question with BeautifulSoup, and it works straight out of the box. – ekhumoro Nov 24 '14 at 21:20
  • Thanks for the explanation. As much as I don't want to use a 3rd party library for something so simple, it'll work until I find a better way to do it :) – Chrispresso Nov 24 '14 at 21:25
  • @ekhumoro: If by "straight out of the box" you mean "on a clean system with no third-party stuff installed besides bs4", then it's using `html.parser` (or `HTMLParser` if Python 2.x). IIRC, `html.parser` actually has an optional "script and style are cdata" hack, which bs4 turns on… but anyway, the general idea of "try a non-strict parser" is probably more important than "use this specific parser with this specific configuration". – abarnert Nov 24 '14 at 21:38
  • @ZWiki: Unfortunately, parsing broken XHTML isn't really as simple as it appears. (In fact, it was kind of the whole point of both HTML 4 and XHTML 1 to eliminate the need to parse broken HTML, because that was a significant part of the complexity, and bugginess, of every browser out there. But of course the browsers bowed to the reality and started handling broken code once enough people started deploying broken code…) – abarnert Nov 24 '14 at 21:41
  • @abarnert, yea I never realized how much random broken stuff there is in websites. Originally I just had a regex pull the `` I needed and it worked but now I'm using `BeautifulSoup`. I tried `HTMLParser` but I hadn't seen an option for `cdata`. Maybe I just wasn't looking hard enough
    – Chrispresso Nov 24 '14 at 21:46
  • @ZWiki: IIRC, there's a documented `set_cdata_mode` function that, if you subclass the parser, you can call on each tag you want to treat as cdata before parsing the tag body, but there's also an undocumented attribute named something like `cdata_content_tags` that can be set per-parser or via a module-wide default, which is a list of tag names that will automatically `set_cdata_mode`. You'd have to read the source for your Python version to see the details. – abarnert Nov 24 '14 at 22:12
  • @abarnert. By "out of the box", I just meant that a simple two-line script will very likely get the desired results straight away (unless you're using an oldish version of python, perhaps). I don't think using `HTMLParser` alone is a good choice, though. You'd have to be certain that cdata junk will be the only problem ever encountered in the source markup, which seems an iffy assumption to make. – ekhumoro Nov 24 '14 at 23:01
  • @ekhumoro: The point is that what bs4 does with that simple two-line script depends entirely on what other packages you've installed. So, if you test it, and it works on your system, it may not work on my system just because, say, you had `lxml` installed but I didn't, or you didn't have anything extra installed but I had `html5parser`. That's why it's important to know about how bs4 selects underlying parsers. As for using `html.parser` alone, I agree; there's a reason I left it out of the answer. (Also, using it to parse XML is sketchy to begin with…) – abarnert Nov 24 '14 at 23:17
  • @abarnert. If bs4 automatically changes the default backend, then I would say that that is misfeature (or even a bug) - so you're certainly right to point that out. However, it's easy enough to explicitly set the backend (to `html.parser`, say), so it shouldn't be too much of an issue once you're aware of that. The python version is a more significant problem for bs4, I think. – ekhumoro Nov 25 '14 at 00:18
  • @ekhumoro: As the docs (linked in the answer) say: "If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser." The reason for this is not just that `html.parser`/`HTMLParser` kind of sucks, but that it's been known to change within minor versions of Python. (IIRC, a bugfix for 2.7.5 caused it to reject some invalid HTML that 2.7.4 usually accepted but sometimes crashed on.) – abarnert Nov 25 '14 at 00:29
  • @ekhumoro: As for Python version, you're right, 2.x vs. 3.0-3.2 vs. 3.3+ definitely makes a difference, and, as I just explained, even bugfix versions can make a difference when using bs4 with the built-in parser. – abarnert Nov 25 '14 at 00:31
  • @abarnert. Whatever the docs say, it's still a misfeature :) It does indeed suck that the python html parser has not been more stable, but that's not a good reason for giving up on it (it's always worked fine for me, but I'm not a power user, and of course, YMMV). – ekhumoro Nov 25 '14 at 00:47