1

I have a data structure in Cython that uses a char * member.

What is happening is that the member value seems to lose its scope outside of a function that assigns a value to the member. See this example (using IPython):

[nav] In [26]: %%cython -f 
      ...: ctypedef struct A: 
      ...:     char *s 
      ...:      
      ...: cdef char *_s 
      ...:      
      ...: cdef void fn(A *a, msg): 
      ...:     s = msg.encode() 
      ...:     a[0].s = s 
      ...:  
      ...: cdef A _a 
      ...: _a.s = _s 
      ...: fn(&_a, 'hello') 
      ...: print(_a.s)          
      ...: print(b'hola') 
      ...: print(_a.s)          
b'hello'
b'hola'
b"b'hola'"

It looks like _a.s is deallocated outside of fn and is being assigned any junk that is in memory that fits the slot.

This happens only under certain circumstances. For example, if I assign b'hello' to s instead of the encoded string inside fn(), the correct string is printed outside of the function.

As you can see, I also added an extra declaration for the char variable and assigned it to the struct before executing fn, to make sure that the _a.s pointer does not get out of scope. However, my suspect is that the problem is assigning the member to a variable that is in the function scope.

What is really happening here, and how do I resolve this issue?

Thanks.

user3758232
  • 758
  • 5
  • 19

1 Answers1

1

Your problem is, that the pointer a.s becomes dangling in the fn-function as soon as it is created.

When calling msg.encode() the temporary byte-object s is created and the address of its buffer is saved to a.s. However, directly afterwards (i.e. at the exit from the function) the temporary bytes-object gets destroyed and the pointer becomes dangling.

Because the bytes object was small, Python's memory manager manages its memory in the arena - which guaranties that there is no segfault when you access the address (lucky you).

While the temporary object is destroyed, the memory isn't overwritten/sanatized and thus it looks as if the temporary object where still alive from A.s's point of view.

Whenever you create a new bytes-object similar in size to the temporary object, the old memory from the arena might get reused, so that your pointer a.s could point to the buffer of the newly allocated bytes-object.

Btw, would you use a[0].s = msg.encode() directly (and I guess you did), the Cython would not build and tell you, that you try to say reference to a temporary Python object. Adding an explicit reference fooled the Cython, but didn't help your case.

So what to do about it? Which solution is appropriate depends on the bigger picture, but the available strategies are:

  1. Manage the memory of A.s. I.e. manually reserve memory, copy from the temporary object, free memory as soon as done.
  2. Manage reference counting: Add a PyObject * to the A-struct. Assign the temporary object ot it (don't forget to increase the reference counter manually), decrease reference counter as soon as done.
  3. Collect references of temporary objects into a pool (e.g. a list), which would keep them alive. Clear the pool as soon as objects aren't needed.

Not always the best, but easiest is the option 3 - you neither have to manage the memory not the reference counting:

%%cython
...
pool=[]   
cdef void fn(A *a, msg):    
    s = msg.encode()
    pool.append(s) 
    a[0].s = s

While this doesn't solve the principal problem, using PyUnicode_AsUTF8 (inspired by this answer) might be a satisfactory solution in this case:

%%cython

# it is not wrapped by `cpython.pxd`:
cdef extern from "Python.h":
    const char* PyUnicode_AsUTF8(object unicode)
...

cdef void fn(A *a, msg): 
 a[0].s = PyUnicode_AsUTF8(msg) # msg.encode() uses utf-8 as default.

This has at least two advantages:

  • the pointer a[0].s is valid as long as msg is alive
  • calling PyUnicode_AsUTF8(msg) is faster than msg.encode(), because it reuses cached buffer, so it basically O(1) after first call, while msg.encode() needs at least copy the memory and is O(n) with n-number of characters.
ead
  • 32,758
  • 6
  • 90
  • 153
  • I want to avoid using Python objects and function calls in this function (other than the necessary input `msg`) because it's a bottom-level one, so adding a list would very likely degrade performance. I may want to go with #1, which is a bit of a pain because my struct has 3 string members like that, but probably the most memory-efficient solution. I never used [cymem](https://github.com/explosion/cymem) but it seems like a good way to make memory management easier. – user3758232 Jan 08 '19 at 01:13
  • For the moment, I just moved the function code in the calling function, because nothing else is using that *for now*, but it's good to know my options for the future. – user3758232 Jan 08 '19 at 01:14
  • Would another solution be having the function return a new struct? Would that remain allocated? – user3758232 Jan 08 '19 at 01:20
  • @user3758232 Without knowing the big picture and using profiler it is hard to tell, what is the "best" solution. Maybe it is better to keep your data in bytes-objects (then creating of temporary objects is no longer needed) rather than unicode-objects, maybe one of the strategies above, maybe something else. – ead Jan 08 '19 at 05:12
  • 1
    @user3758232 I have updated answer with another solution, maybe it is what you are looking for – ead Jan 08 '19 at 09:08