0

If I have HTML like this: <dsometext<f<legit></legit> whatever

What regex pattern do I use to switch < to &lt; before d and f.

I think it's all < which are not followed by a > but I can't wrap the regex for that around my head. I have users typing HTML and then am using jQuery to wrap the HTML and parse the nodes, however bad interim markup blows it up, so I want to swap out the <

Ideas?

Edit

I'm not trying to parse the HTML to valid HTML. I just want to knock out interim characters as users type and the HTML is updated on page. If they are typing <strong>, and are still at the < and I try to put the HTML on the page, it will cause horrible markup. That's why I need to swap it out.

Answer I chose @pimvdb's answer because it correctly answers the question I asked.

However to make the world happier, I found a much simpler way of doing things without using any regex. Basically I had an issue originally where [title] was in place of an element and it had no container element, guaranteed to just contain the title. Therefore changing innerHTML of anything would cause horrors. We simply added the wrapping element. The hesitation to do that and the cause of this thread was due to some crazy reasons specific to the app and backwards comparability for our users.

Dave Stein
  • 8,653
  • 13
  • 56
  • 104
  • using regex to parse html is a nightmare, Id suggest doing it in a way that you can use the DOM which is much better for this sort of thing. – Loktar Dec 15 '11 at 17:49
  • There's no easy way to work out which ones to change to HTML entities. I would suggest giving an error back to the user if they give you invalid HTML and let them sort it out. – a'r Dec 15 '11 at 17:49
  • 2
    obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Donald Miner Dec 15 '11 at 17:51
  • 1
    You should try to validate the input and reject it if not correct, I may be wrong but it looks very complicated to autocorrect some html – Guillaume86 Dec 15 '11 at 17:51
  • @orangeoctopus I was looking after this one, thanks – Guillaume86 Dec 15 '11 at 17:52
  • I would suggest running your html through htmlagilitypack, ensuring that you filter all but a whitelist of elements. Regex ain't an HTML parser. – spender Dec 15 '11 at 17:53
  • @orangeoctopus I read that one and agree actually. It's just that as the user is typing, their HTML is getting put into the DOM so they are going to have bad markup in between. When that bad markup is put onto the page, it parses wrong and I'm left with bad attributes in random nodes. – Dave Stein Dec 15 '11 at 17:54
  • @spender I don't care if its valid in the sense of a correct tag, I just want to make sure interim HTML while user types will not be parsed as a tag. That make sense? I typed a bit more about it in my other comment. – Dave Stein Dec 15 '11 at 17:54
  • How about using an iframe for the rendered "live" html? That will isolate your page from any garbage. – spender Dec 15 '11 at 17:55
  • @spender I need them to see how their page will look as they type. At that same moment I need to know what's typed in case I need to strip anything out... ie [title] becomes something from json data. I know this sounds like horror, and it kinda is, but I'm unsure of a better way. – Dave Stein Dec 15 '11 at 17:58

2 Answers2

1

It's not good practice to parse HTML with regexps, but this will do fine for your sample:

"<dsometext<f<legit></legit> whatever".replace(/(?!<[^<>]+>)</g, "&lt;");

The (?!<[^<>]+>) ensures that the < character to be replaced does not match the <...> pattern.

pimvdb
  • 151,816
  • 78
  • 307
  • 352
  • Thanks that works. But now I realize I run into the same issue if someone types but not . If I have that extra bit accounted for, I'm set. So I'm not validating HTML itself, just that opening "tags" of any sort are accounted for. I don't care if they type , just that it won't be processed if there's no closing . Luckily this runs on a tidbit of HTML and not the entire page. – Dave Stein Dec 15 '11 at 18:06
  • 2
    @Dave Stein: That's becoming a bit complex since you'd need to save the tag name and check where the tag name ends. (And, what about wrong nesting like ``?) I'm not sure this is even possible with regexp... Perhaps a decent parser is rather the way to go. – pimvdb Dec 15 '11 at 18:08
  • yeah before reading your comment I was wondering same thing. The nesting issue breaks down everything. I'm going to see what I can do – Dave Stein Dec 15 '11 at 18:14
0

It is not suggested to do such html or xml parsing but it can be done by replace method itself:

"<dsometext<f<legit></legit>".replace("<d","&lt;d").replace("<f","&lt;f")
dku.rajkumar
  • 18,414
  • 7
  • 41
  • 58