is there possible to write my own punycode converter in php without intl extension?

Question

I do not have that much control of the remote server to install extensions, php is 5.3.8. But I've noticed that there is possible to split utf-8 string with pcre.

So for example: preg_split('@@u','bücher',-1,PREG_SPLIT_NO_EMPTY);

gives: Array ( [0] => b, [1] => ├╝, [2] => c, [3] => h, [4] => e, [5] => r )

or for chinese word: 中国/中华 it gives: Array ( [0] => ńŞş, [1] => ňŤŻ, [2] => /, [3] => ńŞş, [4] => ňŹÄ )

(the results are from non-unicode display), but it is clear that it is possible to split an utf-8 string without international extensions and then (I think) it should be possible to get character codes and do calculations with them to create ascii url.

I'm not really sure what the question is? Maybe posting what you are hoping to get out of the code from the sample input you provided. Also, this is a helpful list of links regarding PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8 — jedwards, Nov 07 '11 at 11:47
To which *international extensions* are you referring to? Can you add a list of those to your question? And yes, it's possible to get character codes out of binary data. If you want to get a UNICODE value for an UTF-8 character, the RFC describes how this is done: http://tools.ietf.org/html/rfc3629 — hakre, Nov 07 '11 at 11:51
What you want to do seems to work fine, doesn't it? What is the problem? — Pekka, Nov 07 '11 at 11:52
ok, I found it -> http://www.phpclasses.org/package/1509-PHP-Convert-from-and-to-IDNA-Punycode-domain-names.html- it is possible and has been done. — rsk82, Nov 07 '11 at 12:11

score 0 · Accepted Answer · edited Nov 07 '11 at 15:58

0

The only things you need to know is the bitmasks that signal double,triple,quad byte code points:

Table from http://en.wikipedia.org/wiki/UTF-8

Bits  Last Code Point  Octet 1  Octet 2  Octet 3  Octet 4

 7    U+007F           0xxxxxxx    -/-      -/-      -/-
11    U+07FF           110xxxxx 10xxxxxx    -/-      -/-
16    U+FFFF           1110xxxx 10xxxxxx 10xxxxxx    -/-
21    U+10FFFF         11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

I don't speak php, but I'm quite sure existing code can be found that uses the shown bitmasks to scan a utf-8 char sequence without actually interpreting it

edited Nov 07 '11 at 15:58

hakre

193,403
52
435
836

answered Nov 07 '11 at 11:51

sehe

374,641
47
450
633

The easiest way in PHP would be to use mb_convert_encoding to convert to UTF-32 big endian and then `$codepoint = ord($c[1]) << 24 | ord($c[2]) << 16 | ord(c[3]) << 8 | ord($c[4])`. – Artefacto Nov 07 '11 at 11:55
1

And by the way, UTF-8 is limited to 21 bits nowadays. – Artefacto Nov 07 '11 at 11:57

is there possible to write my own punycode converter in php without intl extension?

1 Answers1