0

I have some PHP code that I use for text filtering. During filtering, some ASCII characters such as ampersand (&) and tilde (~) are temporarily converted to low ASCII characters (such as decimal code-points 4 and 5). Just before the final filtered output is generated, the conversion is reverted.

$temp = str_replace(array('&', '~'), array("\x04", "\x05"), $input);
... some filtering code to work with $temp ...
$out = str_replace(array("\x04", "\x05"), array('&', '~'), $temp);

This works well with input text of character encodings that use 8-bit code units such as UTF-8 and ISO 8859-1. But I am not sure about input encoded in larger code units, such as UTF-16 or UTF-32. Will the first conversion step mangle the well-formedness of the input text? Will there be some conflict during the reversion step because of some pre-existing characters of the input? The PHP setup does not overload multi-byte string functions.

Can anyone comment? Thanks.

user594694
  • 327
  • 4
  • 13

1 Answers1

1

str_replace works fine, as long as all strings passed to it are in the same encoding. It just does a binary compare/replace of data, so the actual encoding doesn't really matter.

That's why there's no mb_str_replace in this list.

GolezTrol
  • 114,394
  • 18
  • 182
  • 210
  • By 'all strings' do you mean that the '&' and '~' in the last line of the example code I provide should be UTF-16-encoded if the input text is in UTF-16? That is, should the PHP code itself (the PHP file) be in UTF-16? – user594694 Sep 15 '12 at 08:49
  • Preferably, yes. Otherwise the '&' could accidentally match a part of a UTF-16 character in the input string. I would recommend though, not to use UTF-16 at all. UTF-8 is the defacto standard online, and UTF-16 has little advantages. UTF-8 is good for size, UTF-32 for simplicity, and UTF-16 isn't good for either in most cases. – GolezTrol Sep 15 '12 at 15:26
  • Hmmm. The encoding of the input text is not in my control (and I want to avoid converting it to UTF-8). Thanks. – user594694 Sep 16 '12 at 01:11
  • `*I want to avoid converting it to UTF-8*` Why? You will have to have your output in a given encoding too, right? I think the best way to work is to have a single encoding (preferably UTF-8) for all your data anyway. Mixing encodings is asking for trouble. In 'the old days' mixing ANSI code pages was trouble (and still is for many), but now you're introducing a whole new level of mess by mixing Unicode encodings. Mind that UTF-16, also introduces problems with endianness between Windows and Linux. That's another reason to only use UTF-8. – GolezTrol Sep 16 '12 at 09:12