0

ASCII control characters are finding their way in to databases I maintain at a semi-frequent basis. While I plan to implement the means to prevent them from being INSERTed in to SQL I have to first deal with characters already present in the database first.

For starters I know that it's possible to do a find and replace using SQL...

UPDATE [table_name] 
SET [field_name]=REPLACE([field_name],'[string_to_find]','[string_to_replace]');

PHP allows you to use an array to replace multiple items with something else...

$result = str_ireplace(array('1','2','3','4','5'),'0',$my_string);

So instead of lazy SQL queries in loops is there a way we can contruct an SQL query to replace multiple ASCII control characters?

Also very important is how we replace those control characters (e.g. as dec, oct, hex)?

Also I've read that carriage return is part of the control characters 0~31 so obviously we don't want to remove those to make sure line-breaks are retained unless I'm missing something on that?

Here is a reference page I'm going by right now...

http://www.ascii-code.com/


I code XHTML as application/xhtml+xml and tested the following as example1.xhtml and open EXPLICITLY in Firefox with NO BYTE ORDER MARK (BOM)...

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
</head>

<body>

<div>
<p>&#000;&#001;&#002;&#003;&#004;&#005;&#006;&#007;&#008;&#009;
&#010;&#011;&#012;&#013;&#014;&#015;&#016;&#017;&#018;&#019;
&#020;&#021;&#022;&#023;&#024;&#025;&#026;&#027;&#028;&#029;
&#030;&#031;</p>
</div>

</body>
</html>

The following are the only characters that can be used without creating malformed XML, save as example2.xhtml and open EXPLICITLY in Firefox with NO BYTE ORDER MARK (BOM)...

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
</head>

<body>

<div>
<p>&#009;&#010;&#013;</p>
</div>

</body>
</html>

Those working characters that we do not want to strip in SQL are...

  • 09 - Horizontal Tab

  • 10 - Line Feed

  • 13 - Carriage Return

halfer
  • 19,824
  • 17
  • 99
  • 186
John
  • 1
  • 13
  • 98
  • 177
  • So you only want to remove *some* control characters, but not others? Which are the characters that you *do* want to remove (and why)? MySQL does not have any native method of performing operations of this sort, although there are UDFs out there (such as the `PREG_REPLACE()` function provided by the [`lib_mysqludf_preg`](https://github.com/mysqludf/lib_mysqludf_preg) library) that you can use to do the trick. – eggyal Feb 21 '14 at 19:32
  • @eggyal I've done a bit of testing and have updated which characters need to be replaced to keep XML from being malformed when it is received from SQL. – John Feb 21 '14 at 19:52
  • If it's a case of not rendering invalid XML, you should probably pass the database results through an escaping function before outputting as XML. Otherwise the XML might also be broken by non-control (i.e. "printing" characters) such as `<` or `&`. – eggyal Feb 21 '14 at 20:20
  • @eggyal Using `htmlspecialchars()` is effective to an extent. While I'm open to to utilizing other functions or a combination instead the goal is to simply clean out the entire table. – John Feb 21 '14 at 20:28
  • The "control" characters that you have identified as being invalid are indeed not supported by XML 1.0, but they *are* valid in XML 1.1. Out of interest, why are you using XHTML 1.1 in XML 1.0? – eggyal Feb 21 '14 at 23:50
  • To obtain an XML 1.0 compliant string, you can specify appropriate flags to the escaping function - e.g. `htmlentities($str, ENT_XML1 | ENT_DISALLOWED)`. – eggyal Feb 21 '14 at 23:55

1 Answers1

0

This works for starters though does not necessarily do it in mass...

UPDATE a_log SET useragent=replace(useragent,char(4),'');
UPDATE a_log SET useragent=replace(useragent,char(5),'');

What I did was first copy the XML parsing error from Firefox. The Windows 7 clipboard does properly retain the ASCII control characters while doing a copy and paste. Then I pasted the offending text in to the form at this website...

http://www.asciivalue.com/

Finding the line I used the DEC and OCT (their column values matched) to reference the characters on this site...

http://www.ascii-code.com/

...and after a little research I was able to throw together a couple lines to clean out the offending characters that ended up malforming my XML.

I will still accept a better answer as the point is to to more of a dragnet styled fix. Additionally I actively admit this is unfortunately more of a reactionary stance to the problem to which I still must create, test and implement measures to prevent ASCII control characters from being inserted in to MySQL to begin with; I am therefore still looking for a well-written way to prevent this problem.

Should someone ask the same question and come across it they have some effective means of resolving the problem in the very least.

halfer
  • 19,824
  • 17
  • 99
  • 186
John
  • 1
  • 13
  • 98
  • 177