I was looking into how Python represents string after PEP 393 and I am not understanding the difference between PyASCIIObject and PyCompactUnicodeObject.
My understanding is that strings are represented with the following structures:
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Number of code points in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
unsigned int :24;
} state;
wchar_t *wstr; /* wchar_t representation (null-terminated) */
} PyASCIIObject;
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data;
} PyUnicodeObject;
Correct me if I am wrong, but my understanding is that PyASCIIObject is used for strings with ASCII characters only, PyCompactUnicodeObject uses the PyASCIIObject structure and it is used for strings with at least one non-ASCII character, and PyUnicodeObject is used for legacy functions. Is that correct?
Also, why PyASCIIObject uses wchar_t? Isn't a char enough to represent ASCII strings? In addition, if PyASCIIObject already has a wchar_t pointer, why does PyCompactUnicodeObject also have a char pointer? My understanding is that both pointers point to the same location, but why would you include both?