0

I do not have that much control of the remote server to install extensions, php is 5.3.8. But I've noticed that there is possible to split utf-8 string with pcre.

So for example: preg_split('@@u','bücher',-1,PREG_SPLIT_NO_EMPTY);

gives: Array ( [0] => b, [1] => ├╝, [2] => c, [3] => h, [4] => e, [5] => r )

or for chinese word: 中国/中华 it gives: Array ( [0] => ńŞş, [1] => ňŤŻ, [2] => /, [3] => ńŞş, [4] => ňŹÄ )

(the results are from non-unicode display), but it is clear that it is possible to split an utf-8 string without international extensions and then (I think) it should be possible to get character codes and do calculations with them to create ascii url.

hakre
  • 193,403
  • 52
  • 435
  • 836
rsk82
  • 28,217
  • 50
  • 150
  • 240
  • 1
    I'm not really sure what the question is? Maybe posting what you are hoping to get out of the code from the sample input you provided. Also, this is a helpful list of links regarding PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8 – jedwards Nov 07 '11 at 11:47
  • To which *international extensions* are you referring to? Can you add a list of those to your question? And yes, it's possible to get character codes out of binary data. If you want to get a UNICODE value for an UTF-8 character, the RFC describes how this is done: http://tools.ietf.org/html/rfc3629 – hakre Nov 07 '11 at 11:51
  • 1
    What you want to do seems to work fine, doesn't it? What is the problem? – Pekka Nov 07 '11 at 11:52
  • ok, I found it -> http://www.phpclasses.org/package/1509-PHP-Convert-from-and-to-IDNA-Punycode-domain-names.html- it is possible and has been done. – rsk82 Nov 07 '11 at 12:11

1 Answers1

0

The only things you need to know is the bitmasks that signal double,triple,quad byte code points:

Table from http://en.wikipedia.org/wiki/UTF-8

Bits  Last Code Point  Octet 1  Octet 2  Octet 3  Octet 4

 7    U+007F           0xxxxxxx    -/-      -/-      -/-
11    U+07FF           110xxxxx 10xxxxxx    -/-      -/-
16    U+FFFF           1110xxxx 10xxxxxx 10xxxxxx    -/-
21    U+10FFFF         11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

I don't speak php, but I'm quite sure existing code can be found that uses the shown bitmasks to scan a utf-8 char sequence without actually interpreting it

hakre
  • 193,403
  • 52
  • 435
  • 836
sehe
  • 374,641
  • 47
  • 450
  • 633
  • The easiest way in PHP would be to use mb_convert_encoding to convert to UTF-32 big endian and then `$codepoint = ord($c[1]) << 24 | ord($c[2]) << 16 | ord(c[3]) << 8 | ord($c[4])`. – Artefacto Nov 07 '11 at 11:55
  • 1
    And by the way, UTF-8 is limited to 21 bits nowadays. – Artefacto Nov 07 '11 at 11:57