3

we are working on a project, where we have to imitate some export output of an old legacy system.

These exports are text based and encoded in the WINDOWS-1252 encoding, where special characters should be encoded in their decimal/numeric representation, e.g. α should be α.

I tried to use htmlspecialchars, htmlentities and mb_convert_encoding - unfortunately with no luck.

Currently I'm iterating over each character of a string and check if it's an ASCII character or not. If the character is not valid ASCII, I'm transforming it to it's decimal representation using mb_ord, see my function:

private function transformString(string $str)
    {
        if (mb_check_encoding($str, 'ASCII') === true) {
            return $str;
        } else {
            $characters = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
            $transformedString = '';
            foreach ($characters as $character) {
                if (mb_check_encoding($character, 'ASCII') === false) {
                    $character = sprintf('&#%s;', mb_ord($character));
                }
                $transformedString .= $character;
            }
            return $transformedString;
        }
    }

This solution seems to work, but I'm curious if there is a cleaner way for this transformation?

Thanks in advance!

Fabian
  • 53
  • 1
  • 3
  • There's a problem in your task description. α is not part of Windows-1252, but it's in cp437. – daxim Sep 20 '19 at 06:44
  • > I'm curious if there is a cleaner way –– Voting to move to https://codereview.stackexchange.com – daxim Sep 20 '19 at 06:44

1 Answers1

0

This function uses preg_replace_callback () to replace all non-ASCII characters.

function encodeNonAscii($string){
  return preg_replace_callback('/[^\x00-\x7F]/u', 
    function($match){
      return '&#'.mb_ord($match[0]).';';
    },
    $string
  );
}

Only a little shorter and faster.

jspit
  • 7,276
  • 1
  • 9
  • 17