0

Trying to learn a little Cython, I've been attempting to write a toy library that just holds a few cstrings (corresponding to the available choices for a factoral/categorical data type). The strings being pointed to within the class are being overwritten, and my C/Cython-foo is too minimal to figure out why.

The result is something like this:

>>> import coupla
>>> ff = coupla.CouplaStrings(["one", "two"])
>>> ff
write, two
>>> ff
, two
>>> ff
two, two

Help is greatly appreciated! I feel like I'm going crazy. Just using the to_cstring_array and to_str_list functions seems to work fine, but within the class it goes kaputt.

cdef extern from "Python.h":
    char* PyUnicode_AsUTF8(object unicode)

from libc.stdlib cimport malloc, free

cdef char **to_cstring_array(list_str):
    """Stolen from Stackoverflow:
    https://stackoverflow.com/questions/17511309/fast-string-array-cython/17511714#17511714
    """
    cdef Py_ssize_t num_strs = len(list_str)
    cdef char **ret = <char **>malloc(num_strs * sizeof(char *))

    for i in range(num_strs):
        ret[i] = PyUnicode_AsUTF8(list_str[i])

    return ret

cdef to_str_list(char **cstr_array, Py_ssize_t size):
    cdef int i
    result = []

    for i in range(size):
        result.append(bytes(cstr_array[i]).decode("utf-8"))

    return result

cdef class CouplaStrings:
    cdef char **_strings
    cdef Py_ssize_t _num_strings

    def __init__(self, strings):
        cdef Py_ssize_t num_strings = len(strings)
        cdef char **tstrings = <char **> to_cstring_array(strings)

        self._num_strings = num_strings
        self._strings = tstrings

    def __repr__(self):
        """Just for testing."""
        return ", ".join(to_str_list(self._strings, self._num_strings))

    def __dealloc__(self):
        free(self._strings)

Edit:

See the answer below by user2357112. An edited version of CouplaStrings seems to avoid that particular problem, though I wouldn't swear on its overall correctness.

Edit 2: THIS IS WRONG IGNORE ONLY KEPT FOR HISTORICAL PURPOSES

cdef class CouplaStrings:
    cdef char **_strings
    cdef Py_ssize_t _num_strings

    def __init__(self, strings):
        cdef Py_ssize_t num_strings = len(strings)

        cdef char **ret = <char **> PyMem_Malloc(num_strings * sizeof(char *))

        for i in range(num_strings):
            ret[i] = <char *> PyMem_Realloc(PyUnicode_AsUTF8(strings[i]),
                                            sizeof(char *))

        self._num_strings = num_strings
        self._strings = ret

    def __repr__(self):
        """Just for testing."""
        return ", ".join(to_str_list(self._strings, self._num_strings))

    def __dealloc__(self):
        PyMem_Free(self._strings)
James Cunningham
  • 775
  • 4
  • 13
  • The edited version is worse! You've reallocated memory that Python believes that it is managing. You want to work out the size, malloc a new but of memory, then copy. Or just avoid using C strings because they're always painful. – DavidW Mar 24 '19 at 23:06
  • Oh no! well, it's not surprising, I suppose. Back to the drawing board here. – James Cunningham Mar 24 '19 at 23:20

1 Answers1

1

You've failed to account for ownership and memory management.

The UTF-8 encoding returned by PyUnicode_AsUTF8 is owned by the string object PyUnicode_AsUTF8 was called on, and it is reclaimed when that string dies. To prevent the string object from dying before your object does, your object needs to keep a (Python) reference to the string object. Alternatively, you can copy the UTF-8 encodings into memory you allocate yourself, and take responsibility for freeing that memory yourself.

Otherwise, you'll just have an array of dangling pointers.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Thanks, that gives me a new perspective here. I've written code that fixes the original problem, though I'm not sure it's correct. I'll edit the answer with that. – James Cunningham Mar 24 '19 at 22:52