Possible to use wide-character members in Python extension objects?

Question

It's simple to create a member for an object in a Python C extension with a base type of char *, using the T_STRING define in the PyMemberDef declaration.

Why does there not seem to be an equivalent for wchar_t *? And if there actually is one, what is it?

e.g.

struct object contains char *text

PyMemberDef array has {"text", T_STRING, offsetof(struct object, text), READONLY, "This is a normal character string."}

versus something like

struct object contains wchar_t *wtext

PyMemberDef array has {"wtext", T_WSTRING, offsetof(struct object, wtext), READONLY, "This is a wide character string"}

I understand that something like PyUnicode_AsString() and its related methods can be used to encode the data in UTF-8, store it in a basic char string, and decode later, but doing it that way would require wrapping the generic getattr and setattr methods/functions with ones that account for the encoded text, and it's not very useful when you want character arrays of fixed element size within a struct and don't want the effective number of characters that can be stored in it to vary.

I don't know if this answers your question, but: depending on how Python is compiled, Py_UNICODE might be wchar_t. Python can either use 2 bytes per unicode character (i.e. wchar), or 4. So C code needs to use the PyUnicode_* functions to handle unicode strings without assuming what format they're stored in. — Thomas K, May 31 '11 at 20:48
@Thomas: `wchar_t` is either two or four bytes, depending on platform. — Dietrich Epp, Jun 01 '11 at 04:50

score 2 · Accepted Answer · answered Jun 01 '11 at 03:01

Using a wchar_t directly is not portable. Instead, Python defines the Py_UNICODE type as the storage unit for a Unicode character.

Depending on the platform, Py_UNICODE may be defined as wchar_t if available, or an unsigned short/integer/long, the width of which will vary depending on how Python is configured (UCS2 vs UCS4) and the architecture and C compiler used. You can find the relevant definitions in unicodeobject.h.

For your use case, your object can have an attribute that is a Unicode string, using T_OBJECT:

static struct PyMemberDef attr_members[] = {
  { "wtext", T_OBJECT, offsetof(PyAttrObject, wtext), READONLY, "wide string"}
  ...

You can perform type checking in the object's initializer:

...
if (!PyUnicode_CheckExact(arg)) {
    PyErr_Format(PyExc_ValueError, "arg must be a unicode string");
    return NULL;
}
Py_INCREF(arg);
self->wtext = arg;
...

If you ever need to iterate over the low-level characters in the Unicode string, there is a macro which returns a Py_UNICODE *:

int i = 0;
Py_ssize_t size = PyUnicode_GetSize(self->wtext);
Py_UNICODE *chars = PyUnicode_AS_UNICODE(self->wtext);
for (i = 0; i < size; i++) {
    // use chars[i]
    ...

I see. If I'm not mistaken, though, the Python reference seems to recommend the use of `T_OBJECT_EX` over `T_OBJECT` due to how certain cases are handled. — JAB, Jun 01 '11 at 13:18
Yep, you could use `T_OBJECT_EX` instead. For a `READONLY` attribute (which cannot be deleted) a `T_OBJECT` should also work fine. Choice also depends on whether you want a `NULL` value for `self->wtext` to raise an error or just return `None`, which really depends on the behavior you want your object to exhibit. — samplebias, Jun 01 '11 at 15:42

Possible to use wide-character members in Python extension objects?

1 Answers1