5

I am having difficulty to match two text strings. One contains some hidden characters from a text string.

I have a text string: "PR & Communications" stored on an SQL database. When pulled from there, into $database_version, var_dump($database_version) reveals the string to have 19 bytes.

I have scraped (with permission) from a website, some text into a variable, $web_version. Ostensibly the string is "PR & Communications" but it does not match the database version, i.e if($database_version == $web_version) is NOT true.

var_dump() reveals $web_version to have 23 bytes. trim() has no effect, nor does strip_tags() but preg_replace( '/[^\PC\s]/u', $web_version ) removes something because afterwards string_var($web_version) reveals the string to comprise 14 bytes only. It has clearly removed something, possibly too much, as the string still does not match with $database_version.

Any ideas how to:

  1. find out what has been removed
  2. strip out just enough to match $database_version?

PS I don't know how to view the variable in hexadecimal code

peterh
  • 11,875
  • 18
  • 85
  • 108
  • Wen your trying to compare _if( $database_version == $web_version )_ is both variable are coming in string format ? Try with some typecasting and trim method. – Drone Jan 27 '16 at 16:09
  • 1
    You can try using `utf8-decode($web_version)` - http://php.net/manual/en/function.utf8-decode.php. – Scott Jan 27 '16 at 16:27
  • 1
    debugging: to see the string as hex bytes then use `var_dump($web_version, bin2hex($web_version), __FILE__.__LINE__);`. To see what the character represent then: [ASCII Table and Description](http://www.asciitable.com/) and [Complete Character List for UTF-8](http://www.fileformat.info/info/charset/UTF-8/list.htm) – Ryan Vincent Jan 27 '16 at 17:01
  • 1
    Thank you Ryan, your var_dump formula revealed that one value had the '&' as an ampersand and the other as &, hence the two values did not match. This helped me solve the problem. – heroicadventures Feb 03 '16 at 13:17

1 Answers1

3
$v = preg_replace('/\s+|[[:^print:]]/', '', $string);

trim() removes only " \t\n\r\0\x0B" (see docs), so use snippet above to remove non-printed characters from string.

Miguel V.
  • 43
  • 5
Aleksey Ratnikov
  • 569
  • 3
  • 11
  • this helped me resolve a slightly different issue. Perhaps you could clarify on non printed characters and what this regex actually does? – Thomas Clowes Apr 08 '16 at 16:40
  • 1
    `[[:print:]]` is PCRE print character class (alias for complex regex, more of them: http://php.net/manual/en/regexp.reference.character-classes.php) syntax. Print character means visible on page render. `^` symbol inside character class or group means negation, so `[[:^print:]]` means non-printable character - ones that are not visible after page render (like BOM-mark, for example). Other parts of regex is very easy - `\s` stands for "any space symbols", (space, tab, new line ,etc.), `+` means "repeat one or more times", pipe (`|`) means "or". – Aleksey Ratnikov Apr 08 '16 at 16:52
  • 1
    So, as whole, it could be read as "find any space symbol or non-printable character". – Aleksey Ratnikov Apr 08 '16 at 16:53