2

We just released some code to make our software a little bit more user friendly, and it backfired. Basically, we're attempting to replace newlines with <br /> tags. The trouble is, sometimes our users will enter code like the following:

<a
 href='http://nowhere.com'>Nowhere</a>

When we run our code, this translates to

<a <br />href='http://nowhere.com' />Nowhere</a>

which obviously doesn't render properly.

Is there a regular expression or a PHP function to strip, or perhaps compress, the whitespace between the attributes of an HTML tag?

Clarification: This isn't full HTML. It's more similar to Markdown or some other language (we will eventually be moving to Markdown, but I need a quick fix). So I can't just parse this as regular HTML. The newlines need to be converted to <br /> tags properly.

Topher Fangio
  • 20,372
  • 15
  • 61
  • 94

4 Answers4

3

Hmmm, why are you using tools for formatting html when there not designed for that purpose, get your self a DOM Library.

http://simplehtmldom.sourceforge.net/

RobertPitt
  • 56,863
  • 21
  • 114
  • 161
  • It's not valid HTML that I'm parsing, it's text that may have HTML inside of it. So, the parts that are HTML need to be valid, but the rest is just text. – Topher Fangio Dec 07 '10 at 18:51
  • So do you have the original content without the `
    ` tags implemented ? if so, place that into the DOM Parser, loop threw every element and place attributes in a new fresh tag, this would then be formatted.
    – RobertPitt Dec 07 '10 at 18:59
  • I'm not sure I follow. If I place text into the DOM Parser, I'll lose all of the newlines that need to be converted into `
    ` tags right?
    – Topher Fangio Dec 07 '10 at 19:21
  • the text you have above is the text you have translated new lines to break tags right, so if you have the version of html (With the newlines) then place that html into a DOM parser, loop threw every element it finds and extract each tag and its attributes / values and insert into a fresh new element create by the DOM parser, once the whole document has been studied by the DOMparser them output it as html, you should have a clean document. Why do you need to replace new lines new lines anyway, there not used in web design what so every, so you should not need to replace them with `
    `
    – RobertPitt Dec 07 '10 at 19:51
  • These are e-mail templates and user signatures, so most of the time they will have only text. The text needs to converted to HTML to be displayed in an e-mail message and be formatted with newlines like the user requested. In addition, these templates sometimes need to contain basic HTML for links back to the site. Since the entire body isn't really HTML, I can't use a DOM parser. I found a solution I'll add shortly. – Topher Fangio Dec 07 '10 at 20:34
2

You need a library which would correctly parse all HTML you throw at it, you never known what users may invent.

Look at HTML Purifier

Denis Nikolaenko
  • 250
  • 2
  • 13
1

After some searching and much trial and error, I have come up with the following solution/hack:

/*
 * Compress all whitespace within HTML tags (including PRE at the moment)
 */
$regexp = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

preg_match_all($regexp, $text, $matches);

foreach($matches[0] as $match) {
  $new_html = preg_replace('/\s+/', ' ', $match);
  $text = str_replace($match, $new_html, $text);
}

After executing this code, all HTML tags in $text will be properly formatted and valid with NO newline characters.

I know that this isn't the best solution, but it works, and pretty soon we'll be migrating to a true markup language (such as Markdown).

Topher Fangio
  • 20,372
  • 15
  • 61
  • 94
  • I have up-voted the other answers that were helpful, but decided to accept my own answer since it was my actual solution for this particular problem. – Topher Fangio Dec 16 '10 at 14:00
0

Ideally, you would use an XML parser, through DOM or SAX APIs. However, if your content is not proper XML, but plain text with a few tags, the parser may fail (it depends on the tool used, I guess).

A rough solution for your particular problem may be as follows: construct a state machine with two states, inside a tag and outside a tag. You read the input character by character. Upon reading '<', switch to the "inside" state. Upon reading '>', switch to the "outside" state. Upon reading '\n' and if in the "outside" state, emit "<br />" (otherwise emit nothing).

This is just a sketch, and it may need to be refined.

ChrisJ
  • 5,161
  • 25
  • 20