0

I have a mapping Nx2 between two set of encodings (not relevant: Unicode and GB18030) under this format: Warning: huge XML, don't open if having slow connection: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

Snapshot:

<a u="00B7" b="A1 A4"/>
<a u="00B8" b="81 30 86 30"/>
<a u="00B9" b="81 30 86 31"/>
<a u="00BA" b="81 30 86 32"/>

I would like to save the b-values (right column) in a data structure and to access them directly (no searching) with indexes based on a-values (left column).

Example:

I can store those elements in a data structure like this:

unsigned short *my_page[256] = {my_00,my_01, ....., my_ff}

, where the elements are defined like:

static unsigned short my_00[256] etc.

. So basically a matrix of matrix => 256x256 = 65536 available elements.

In the case of other encodings with less elements and different values (ex. Chinese Big5, Japanese Shift, Korean KSC etc), I can access the elements using a bijective function like this:

element = my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF];, where unicode[i] is filled with the a-like elements from the mapping (as mentioned above). How do I generate and fill the my_page structure is analogous. For the working encodings, I have like around 7000 characters to store (and they are stored in a unique place in my_page).

The problem comes with the GB18030 encoding, trying to store 30861 elements in my_page (65536 elements). I am trying to use the same bijective function for filling (and then accessing, analogously) the my_page structure, but it fails since the access mode does not return unique results.

For example: For the unicode values, there are more than 1 element accessed via my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF] since the indexes can be the same for i and for i+1 for example. Do you know another way of accessing/filling the elements in the my_page structure based only on pre-computed indexes like I was trying to do?

I assume I have to use something like a pseudo-hash function that returns me a range of values VRange and based on a set of rules I can extract from the range VRange the integer indexes of my_page[256][256].

If you have any advice, please let me know :)

Thank you !

Alex
  • 340
  • 4
  • 17
  • 1
    GB18030 has 4 byte encodings: trying to store it in a 2 byte lookup table is wrong. I guess you are mapping from "unicode" to "GB"? UTF16 and UTF8 both contain 4 byte elements. UCS2 is the two byte variant, but nobody should be using it (heck, windows 2000 upgraded to UTF16). I suspect you have a fundamental misunderstanding of the encodings involved, as reflected by your datastructure. Unicode is not an encoding, it is (at best) a family of encodings. – Yakk - Adam Nevraumont Mar 25 '15 at 11:25
  • Yes, I have to use a 4 bytes lookup table. I have a mapping between UTF16 and GB18030. And I want to fill the GB18030 elements in a structure and to access them directly via the indexes based on their UTF16-correspondent. So I need: gbElem = my_page[ f(utf16Elem) ][ g(utf16Elem) ], where f and g are functions that have to return integers. – Alex Mar 25 '15 at 13:01
  • There are more than 2^16 elements in the [UCS](http://en.wikipedia.org/wiki/Universal_Character_Set), so two 8 bit indexes cannot represent every value in the UCS. [GB 18030](http://en.wikipedia.org/wiki/GB_18030) can encode *every* unicode code point, so your mapping does not work in either direction. It will work for the BMP (the basic multilingual plane), but that isn't enough. Your design would work fine to map UCS-2 (ancient 16 bit unicode single-wchar_t encoding, as opposed to UTF-16 which can be multi-character) to GB_18030, but GB_18030 is *bigger* than UCS-2. – Yakk - Adam Nevraumont Mar 25 '15 at 13:50

1 Answers1

0

For GB18030, refer to this document: http://icu-project.org/docs/papers/gb18030.html

As explained in this article: “The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size.” So most probably is not good to implement a conversion based on a pure mapping table. For large parts, there is a direct mapping between GB18030 and Unicode. Most of the four-bytes characters can be translated algorithmically. The author of the article suggests to handle them such ranges with a special code, and the other ones with a classic mapping table. These characters are the ones given in the XML mapping table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

Therefore, the index-based access on Matrix-like structure in C++ can be a problem opened for whom wants to research on such bijective functions.

Alex
  • 340
  • 4
  • 17