4

I have a simple python script

import _tph
str = u'Привет, <b>мир!</b>' # Some unicode string with a russian characters
_tph.strip_tags(str)

and C library, which is compiled into _tph.so. This is a strip_tags function from it:

PyObject *strip_tags(PyObject *self, PyObject *args) {
    PyUnicodeObject *string;
    Py_ssize_t length;

    PyArg_ParseTuple(args, "u#", &string, &length);
    printf("%d, %d\n", string->length, length);

    // ...
}

printf function prints this: 1080, 19. So, str length is really 19 symbols, but from what deep of hell I'm getting those 1080 characters?

When I'm printing string, I got my str, null char, and then a lot of junk bytes.

Junk memory looks like this:

u'\u041f\u0440\u0438\u0432\u0435\u0442, <b>\u043c\u0438\u0440!</b>\x00\x00\u0299\Ub7024000\U08c55800\Ub7025904\x00\Ub777351c\U08c79e58\x00\U08c7a0b4\x00\Ub7025904\Ub7025954\Ub702594c\Ub702591c\Ub702592c\Ub7025934\x00\x00\x00

How I can get a normal string here?

SvartalF
  • 163
  • 2
  • 8

1 Answers1

6

The "string" argument here isn't well named. It is a pointer to a Python Unicode object, so your printf is seeing a lot of binary data (the object type, GC headers, the ref count, and the encoded unicode code points) until it happens to find a zero byte which printf interprets as the end of the string.

The simplest way to view the string is with PyObject_Print(string). You can find the C functions for manipulating Python unicode objects at: http://docs.python.org/c-api/unicode.html#unicode-objects

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • In fact, I'm getting a segmentation fault with a code kind of this: `PyObject_Print((PyObject *)string, stdout, 0);` And I had tried to save thread state for GIL, yep. – SvartalF Oct 31 '11 at 15:14
  • "string" is declared as PyUnicode object. To get that object, change the parsing code to "O" and use PyObject_Print() on the result. Alternatively, change the declaration to a unicode buffer pointer and continue to use "u#". The latter gives you a pointer to a counted array (not null terminated for use with printf). – Raymond Hettinger Oct 31 '11 at 15:20