1

Consider the following contrived Cython function to join a list of strings:

# cython: language_level=3
cpdef test_join():
    """ ["abc", "def", "ghi"] -> "abcdefghi" """
    cdef:
        list lines = ["abc", "def", "ghi"]
        char* out = ""
        char* line = ""
        int i
    for i in range(len(lines)):
        line = lines[i]
        out = out + line
    return out

It will fail to compile, with this error:

Storing unsafe C derivative of temporary Python reference

I am assuming this has to do with line being of type char* and continually re-assigned. I have seen an answer to a similar question, but haven't been able to modify that answer for this basic example. (And it also involves a good amount of C-API with which I'm not familiar.)

How can I modify the above function to compile and return as expected?


More broadly, I want to better understand this error. Commit 37e4a20 has a bit of explanation:

Taking a char* from a temporary Python string object ... A compile time error is raised only when such a pointer is assigned to a variable and would thus exceed the lifetime of the string itself.


Update: to simplify things a bit further, it looks like it's the assignment that causes the issue:

cpdef int will_succeed():
    cdef char* a = b"hello"
    cdef char* b = b" world"
    print(a + b)  # no new assignment
    return 1

cpdef will_fail():
    cdef char* a = b"hello"
    cdef char* b = b" world"
    a = a + b  # won't compile
    return a

I suspect there may be a more proper way do this this with something from string.pxd/string.h, but I'm pretty weak on C memory management and efficiency:

from libc.string cimport strcat, strcpy

cpdef use_strcat():
    cdef char out[1024]
    strcpy(out, b"")

    cdef char* a = b"hello"
    cdef char* b = b" world"

    strcat(out, a)
    strcat(out, b)
    return out
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235

1 Answers1

3

I think the issue is with

out = out + line

Cython doesn't define an operator + for C strings. Instead it converts them to Python strings and concatenates those:

tmp1 = str(out)
tmp2 = str(line)
tmp3 = tmp1 + tmp2
out = get_c_string_from(tmp3)

out therefore becomes an invalid pointer as soon as tmp3 is destroyed (which is instantly).


I'd avoid using strcat because it's not very efficient for repeated uses. Instead keep track of the current string length and copy the data yourself. Given that you have an unknown length you probably want to allocate the string with malloc (in which case you're responsible for freeing it)

from libc.stdlib cimport free, malloc, realloc
from libc.string cimport memcpy

from cython import Py_ssize_t

cdef char         *line
cdef Py_ssize_t   i
cdef Py_ssize_t   length = 0
cdef Py_ssize_t   incrlength
cdef char         *out = <char *>malloc(1)  # Reallocate as needed

try:
    out[0] = b'\x00' # keep C-strings null-terminated
    for i in range(len(lines)):
        line = lines[i]
        incrlength = len(line)
        out = <char *>realloc(out, length + incrlength + 1)
        memcpy(out + length, line, incrlength)
        length += incrlength
        out[length] = '\x00'  # keep C-strings null-terminated
    return out  # autoconversion back to a Python string

finally:
   free(out)

This is a rough outline of what I think you should be doing, and isn't really tested.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
DavidW
  • 29,336
  • 6
  • 55
  • 86
  • I'm not sure if this is fully correct. It doesn't appear to be the concatenation that's the problem; it's the assignment. See: https://pastebin.com/RX2c423Q – Brad Solomon Dec 01 '18 at 20:30
  • 1
    @BradSolomon as far as I understand this answer, it doesn't say that the problem `+`, but rather that the result of '+' is a temporary python bytes-object and `char *out` references the data inside it, which gets destroyed because the temporary object gets out of scope (which is instantly) leaving `out` a dangling pointer. – ead Dec 01 '18 at 20:41
  • I think that something from `string.pxd` might be usable here: https://github.com/cython/cython/blob/master/Cython/Includes/libc/string.pxd; see updated question – Brad Solomon Dec 01 '18 at 21:10
  • I've edited in an outline of how I'd solve this... It's not guaranteed to work exactly, but I think it illustrates the approach I'd be taking – DavidW Dec 01 '18 at 23:44
  • Thank you @DavidW, after some reading I'll be staying away from `strcat()` within a loop. That said, I believe that `cdef char* out = malloc...` will not compile because the [signature](https://github.com/cython/cython/blob/master/Cython/Includes/libc/stdlib.pxd) is `void *malloc (size_t size)`. Appreciate the answer regardless – Brad Solomon Dec 02 '18 at 19:37
  • Ah - good point. Cython requires you to cast malloc: `cdef char* out = malloc(...)` (and the same for realloc) – DavidW Dec 02 '18 at 21:07
  • @DavidW I made some edits to your answer following your comments. One more question for you - would it be bad practice to `malloc(n)` where `n` is the sum of string lengths? It seems in some benchmarks that `realloc` is also fairly expensive. I'm wondering if there's a downside to doing all the allocation upfront, in terms of space or some other unseen bugginess – Brad Solomon Dec 04 '18 at 18:53
  • 1
    @BradSolomon No - if you know the string length at the start then it'd make sense to do the allocation upfront (or even if you know an upper bound and it isn't ridiculously oversized). In hindsight I'm now slightly unsure why I used realloc... – DavidW Dec 04 '18 at 19:20