2

I've got nonstandard characters coming out of my database (due to line breaks).

My HTML validator is complaining about them.

Since my HTML validator is a direct extension of my ego, I'd like to keep the thing happy and green-ok-arrow-y.

Does someone who's done this before have a quick fix?

BTW I don't want to change the page's charset, doctype, or the data. Just looking for a utf8_decode() type thing that would clean up the string, but utf8_encode() and utf8_decode() don't work...

UPDATE

Sorry, "non-standard characters" is a bit vague, but then so is this error warning. Specifically, they're not SGML characters, which apparently don't fit the SGML parser...but now I get into the fuzzy territory, not sure what's going on.

Ben
  • 54,723
  • 49
  • 178
  • 224
  • What exactly are "nonstandard characters"? – deceze Jul 12 '11 at 03:11
  • Can you tell us exactly what the "non-standard" characters are? The set of legal characters in XML is here: http://www.w3.org/TR/xml/#charsets -- are you trying to validate as XHTML? – Ray Toal Jul 12 '11 at 03:12
  • @Ray Toal - They're line breaks from HeidiSQL. The error says: "non SGML character number 30". Originally, they're line breaks in a text area, which is sent to HeidiSQL and stored. The problem starts when the values are returned from HeidiSQL as weird linebreak things. – Ben Jul 12 '11 at 03:16
  • Edited my answer to show how to deal with that character (U+001E) – Ray Toal Jul 12 '11 at 03:19

1 Answers1

2

If by non-standard characters you mean the XHTML validator sees characters in your document that are not permitted by the XML specification, which is here: http://www.w3.org/TR/xml/#charsets then your solution is to use XML entities to escape them. For example if you have the illegal character U+0004, then you can turn that into  in PHP before writing it out.

If by non-standard characters you mean your byte sequence is so whacked that it is not a legal byte sequence of UTF-8 (i.e., it cannot be decoded), then you have a logic error in your application. Perhaps you are reading bytes instead of asking PHP to read characters and encode them properly.

EDIT: In response to the comment above about the illegal character being number 30, well that is indeed an illegal character in XML and thus XHTML. If you intend them to be line breaks, then do a php regex substitution to replace \x1E with \n.

Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • Cool beans man! Thanks, that's a great explanation and the solution works. A well-earned 25 rep to you sir. – Ben Jul 12 '11 at 03:22
  • 1
    For any other HeidiSQL users out there, the specific line of code that worked for me was `preg_replace('/\x1E/','',$str)`, since the newline character is sent also. – Ben Jul 12 '11 at 03:29