0

I am to trying call c-interface from python using ctype module. Below is the prototype of C function

void UTF_to_Wide_char( const char* source, unsigned short* buffer, int bufferSize)

UTF_to_Wide_char : converts a UTF-* string into a UCS2 string

source (input) : contains a NULL terminated UTF-8 string

buffer (output) : pointer to a buffer that will hold the converted text

bufferSize : indicates the size of the buffer, the system will copy upto this size including the NULL.

Following is my python function:

def to_ucs2(py_unicode_string):
    len_str = len(py_unicode_string)
    local_str = py_unicode_string.encode('UTF-8')
    src = c_wchar_p(local_str)
    buff = create_unicode_buffer(len_str * 2 )
    # shared_lib is my ctype loaded instance of shared library.
    shared_lib.UTF8_to_Widechar(src, buff, sizeof(buff))
    return buff.value

Problem : Above code snippet works fine in python compiled with ucs-4 ( --enable-unicode=ucs4 option ) and will behave unexpected with python compiled with UCS-2 ( --enable-unicode=ucs2 ). ( Verified python unicode compilation option by referring to How to find out if Python is compiled with UCS-2 or UCS-4? )

Unfortunately in production environment I am using python compiled with UCS-2. Please comment on following points.

  1. Although I am sure about issue is from unicode option, I yet to nail down what is happening under the hoods. Need help in coming up with the required justification.
  2. Is it is possible to overcome this issue, without compiling python with --enable-unicode=ucs4 option?

( I am quite new to unicode encoding stuff. But have a basic know-how. )

Community
  • 1
  • 1
user2586432
  • 249
  • 1
  • 4
  • 12
  • Python 2.7.5 ( compiled with --enable-unicode=ucs4 ) and Python 2.6.4 ( --enable-unicode=ucs2 ). Sorry, I missed out mentioning python version. – user2586432 Jul 23 '15 at 03:44
  • I assume `py_unicode_string` is a `unicode` instance instead of `str`. In that case passing a `c_wchar_p` to a C function is already UTF-16 encoded in a narrow build (FYI, UCS-2 is limited to the basic multilingual plane, i.e. codes less than 65536). In a wide build you can manually encode to `'utf-16'`, so are certain you'd rather call `UTF8_to_Widechar`? – Eryk Sun Jul 23 '15 at 04:12
  • Its true that py_unicode_string is a unicode instance. (If it is not I am forcing it by encoding it : local_str ). Here are you suggesting me to avoid c_wchar_p? Sorry I am not sure that, I got the exact point you are trying to convey. Request you to elaborate a bit. – user2586432 Jul 23 '15 at 05:12
  • You cannot force an arbitrary `str` byte string to UTF-8 via `encode('utf-8')`; that's a source of confusion in Python 2 that was removed from Python 3. What happens is the string first gets decoded as 7-bit ASCII. Thus `'\x80'.encode('utf-8')` fails because valid ASCII is limited to the range 0x00-0x7F. – Eryk Sun Jul 23 '15 at 05:29
  • Ok. I am integrating a third party c-library, which requires 16-bit wide-character string. ( In fact UTF8_to_Widechar is helper method supplied by the library ). Hope this makes sense. Please comment. – user2586432 Jul 23 '15 at 05:52
  • Just use Python's native `encode` method to create UTF-16 encoded strings, and pass them as `c_char_p`. This way lets Python handle getting the buffer size correct (e.g. to handle surrogate pairs for non-BMP characters). – Eryk Sun Jul 23 '15 at 21:16
  • Using python native methods worked. Will go with this solution. As you pointed out using c_wchar_p is a mistake. Even after using c_char_p, still seeing issue mentioned in mentioned in the question. So still curious to understand what is going on. – user2586432 Jul 25 '15 at 15:45
  • Give an example, please, including your expected result vs the actual result. – Eryk Sun Jul 25 '15 at 16:04

0 Answers0