19

Recently ran into a very odd issue where my database contains strings with what appear to be normal whitespace characters but are in fact something else.

For instance, applying trim() to the string:

"TEST "

is getting me:

"TEST "

as a result. So I copy and paste the last character in the string and:

echo ord(' ');
194

194? According to ASCII tables that should be . So I'm just confused at this point. Why does this character appear to be whitespace and how can I trim() characters like this when trim() fails?

7 Answers7

26

It's more likely to be a two-byte 194 160 sequence, which is the UTF-8 encoding of a NO-BREAK SPACE codepoint (the equivalent of the   entity in HTML).

It's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s would match it, but a plain comparison with a space won't; nor will trim() remove it.

To replace NO-BREAK spaces with a normal space, you should be able to do something like:

$string = str_replace("\u{c2a0}", " ", $string);

or

$string = str_replace("\u{c2a0}", "", $string);

to remove them

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • thanks for this answer as it provides more detail but unfortunately didn't work for me. for what it's worth I have lib_mysqludf_preg installed and was able to do this to fix my tables: UPDATE table SET cell = trim(preg_replace('/[[:space:]]+/',' ',cell)); – But those new buttons though.. Feb 23 '17 at 20:00
  • Thanks Mark for the clarification, but no, a regular expression \s does *not* match. See my answer for a working alternative. – Codesmith Jun 30 '21 at 00:30
  • Upvoted, your explanation is 100% correct, just do this to prove it: `$string = html_entity_decode(" "); print ORD($string[0]);` Output: `194`. ***But!***, [your solution is invalid.](https://3v4l.org/TNtk5) Codesmith's answer, however, does work, still upvoting for the explanation. – HoldOffHunger Jan 05 '22 at 15:54
  • wow this just solved a mystery in my database! I have some input having spaces but I cannot find the match, so turned out some users are entering char 194 as space. How could they do so?? From their laptop they can select UTF-8 or something? – Rosemary Jul 04 '23 at 06:25
14

You can try with :

PHP trim

$foo = "TEST ";
$foo = trim($foo);

PHP str_replace

$foo = "TEST ";
$foo = str_replace(chr(194), '', $foo);

IMPORTANT: You can try with chr(194).chr(160) or '\u00A0'

PHP preg_replace

$foo = "TEST ";
$foo = preg_replace('#(^\s+|\s+$)#', '', $foo);

OR (i'm not sure if it will work well)

$foo = "TEST ";
$foo = preg_replace('#[\xC2\xA0]#', '', $foo);
Olaf Erlandsen
  • 5,817
  • 9
  • 41
  • 73
5

Had the same issue. Solved it with

trim($str, ' ' . chr(194) . chr(160))
Alex
  • 1,605
  • 11
  • 14
3

You probably got the original data from Excel/CSV.. I'm importing from such format to my mysql db and it took me hours to figure out why it came padded and trim didn't appear to work (had to check every character in each CSV column string) but in fact it seems Excel adds chr(32) + chr (194) + chr(160) to "fill" the column, which at first sight, looks like all spaces at the end. This is what worked for me to have a pretty, perfect string to load into the db:

  // convert to utf8
  $value = iconv("ISO-8859-15", "UTF-8",$data[$c]);
  // excel adds 194+160 to fill up!
  $value = rtrim($value,chr(32).chr(194).chr(160));
  // sanitize (escape etc)
  $value = $dbc->sanitize($value);
cdsaenz
  • 520
  • 1
  • 10
  • 15
2
php -r 'print_r(json_encode(" "));'
"\u00a0"
$string = str_replace("\u{00a0}", "", $string); //not \u{c2a0}
Awesome
  • 181
  • 1
  • 3
  • Good catch, was wondering why that didn't work when @Mark Baker posted it. You should add a bit more explanation though instead of just code only answer. – But those new buttons though.. May 09 '20 at 04:09
  • "194, 160" is decimal and "c2a0" is hex , they are essentially the same. we need unicode code point after "\u" and "c2a0" is not unicode code point. we can get unicode code point using json_encode and we get "\u00a0". you can get more infomation here https://www.utf8-chartable.de/ – Awesome May 15 '20 at 08:26
2

I needed to trim my string in PHP and was getting the same results.

After discovering the reason via Mark Bakers answer, I used the following in place of trim:

// $str = trim($str); // won't strip UTF-8 encoded nonbreaking spaces
$str = preg_replace('/^(\\s|\\xC2\\xA0)+|(\\s|\\xC2\\xA0)+$/', '', $str);
Codesmith
  • 5,779
  • 5
  • 38
  • 50
0

Thought I should contribute an answer of my own since it has now become clear to me what was happening. The problem originates dealing with html which contains a non-breaking space entity,  . Once you load the content in php's DOMDocument(), all entities are converted to their decoded values and upon parsing the it you end up with a non-breaking space character. In any event, even in a different scenario, the following method is another option for converting these to regular spaces:

$foo = str_replace(' ',' ',htmlentities($foo));

This works by first converting the non-breaking space into it's html entity, and then to a regular space. The contents of $foo can now be easily trimmed as normal.