12

I have strings in PHP which I read from a database. The strings are URLs and at first glance they look good, but there seems to be some weird character at the end. In the address bar of the browser, the string '%E2%80%8E' gets appended to the URL, which breaks the URL.

I found this post on stripping the left-to-right-mark from a string in PHP and it seems related to my problem, but the solution does not work for me because my characters seem to be something else.

So how can I determine which character I have so I can remove it from the strings?

(I would post one of the URLs here as an example, but the stack overflow form strips the character at the end as soon as I paste it in here.)

I know that I could only allow certain chars in the string and discard all others. But I would still like to know what char it is -- and how it gets into the database.

EDIT: The question has been answered and the code given in the accepted answer works for me:

$str = preg_replace('/\p{C}+/u', "", $str);
Community
  • 1
  • 1
spirit
  • 441
  • 5
  • 14
  • I would use regular expressions to exclude them. See: http://www.roscripts.com/PHP_regular_expressions_examples-136.html – Anthony Horne Apr 17 '14 at 10:35
  • did you try the solution from user "YOU" ? – Casimir et Hippolyte Apr 17 '14 at 10:39
  • @CasimiretHippolyte Thanks. The preg_replace version given by the user YOU works for me, I just tried it. But which char is it? And why did the accepted solution not work if it was the right-to-left mark? – spirit Apr 17 '14 at 14:12

1 Answers1

24

If the input is utf8-encoded, might use unicode regex to match/strip invisible control characters like e2808e (left-to-right-mark). Use u (PCRE_UTF8) modifier and \p{C} or \p{Other}.

Strip out all invisibles:

$str = preg_replace('/\p{C}+/u', "", $str);

Here is a list of \p{Other}


Detect/identify invisibles:

$str = ".\xE2\x80\x8E.\xE2\x80\x8B.\xE2\x80\x8F";

// get invisibles + offset
if(preg_match_all('/\p{C}/u', $str, $out, PREG_OFFSET_CAPTURE))
{
  echo "<pre>\n";
  foreach($out[0] AS $k => $v) {
    echo "detected ".bin2hex($v[0])." @ offset ".$v[1]."\n";
  }
  echo "</pre>";
}

outputs:

detected e2808e @ offset 1
detected e2808b @ offset 5
detected e2808f @ offset 9

Test on eval.in

To identify, look up at Google e.g. fileformat.info:

@google: site:fileformat.info e2808e

Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • 1
    Thanks a lot, this answers the question. The preg_replace works and the function given identified the character as e2808e, which -- according to the suggested google search term -- is indeed the Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E). I have accepted this answer. – spirit Apr 17 '14 at 14:24
  • 2
    @jonny 5, been looking for this for like 6 hours, tried all manner of regex and here it what I needed all along. thank you greatly – Andrew Killen Sep 17 '19 at 06:14
  • This will also strip [soft hyphens](https://en.wikipedia.org/wiki/Soft_hyphen), but per context their main intention is to be used for output reasons. – AmigoJack Jan 13 '22 at 10:52