2

Given a file /myfiles/file_with_invalid_encoding.txt with invalid UTF8 as:

parse this correctly
Føö»BÃ¥r
also parse this correctly

I am using the builtin Python open function from the C API as follows the minimal example (excluding C Python setup boilerplate):

const char* filepath = "/myfiles/file_with_invalid_encoding.txt";
PyObject* iomodule = PyImport_ImportModule( "builtins" );

if( iomodule == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* openfunction = PyObject_GetAttrString( iomodule, "open" );

if( openfunction == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* openfile = PyObject_CallFunction( openfunction, 
       "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );

if( openfile == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* iterfunction = PyObject_GetAttrString( openfile, "__iter__" );
Py_DECREF( openfunction );

if( iterfunction == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* openfileresult = PyObject_CallObject( iterfunction, NULL );
Py_DECREF( iterfunction );

if( openfileresult == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* fileiterator = PyObject_GetAttrString( openfile, "__next__" );
Py_DECREF( openfileresult );
if( fileiterator == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* readline;
std::cout << "Here 1!" << std::endl;

while( ( readline = PyObject_CallObject( fileiterator, NULL ) ) != NULL ) {
    std::cout << "Here 2!" << std::endl;
    std::cout << PyUnicode_AsUTF8( readline ) << std::endl;
    Py_DECREF( readline );
}
PyErr_PrintEx(100);
PyErr_Clear();

PyObject* closefunction = PyObject_GetAttrString( openfile, "close" );

if( closefunction == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* closefileresult = PyObject_CallObject( closefunction, NULL );
Py_DECREF( closefunction );

if( closefileresult == NULL ) {
    PyErr_PrintEx(100); return;
}

Py_XDECREF( closefileresult );
Py_XDECREF( iomodule );
Py_XDECREF( openfile );
Py_XDECREF( fileiterator );

I am calling the open function passing the ignore parameter to ignore encoding errors, but Python is ignoring me and keeps throwing encoding exceptions when it finds invalid UTF8 characters:

Here 1!
Traceback (most recent call last):
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 26: invalid start byte

As you can see above, and here bellow, when I am calling the builtins.open() function, I am passing the ignore parameter, but it does not have any effect. I also trying changing ignore to replace, but C Python keeps throwing enconding exceptions anyways:

PyObject* openfile = PyObject_CallFunction( openfunction, 
       "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );
Evandro Coan
  • 8,560
  • 11
  • 83
  • 144
  • I'm not sure it's your *only* problem, but it appears wrong that right after you set the initial value of `openfile` by calling `openfunction`, it is the value of the latter, not the former, that you test for being null. – John Bollinger Jun 06 '19 at 18:01
  • Thanks, I fixed it and retested the program. But the encoding problem persists. I also added a `"Here 1!"` and `"Here 2!"` and when running it, only `Here 1!` shows up before the stacktrace. – Evandro Coan Jun 06 '19 at 18:11

2 Answers2

1

PyObject_CallFunction (and Py_BuildValue, and others) takes a single format string describing all of the arguments. When you do

PyObject* openfile = PyObject_CallFunction( openfunction, 
   "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );

you've said "one string argument" and all the arguments after filepath get ignored. Instead you should do:

PyObject* openfile = PyObject_CallFunction( openfunction, 
   "ssiss", filepath, "r", -1, "UTF8", "ignore" );

to say "5 arguments: 2 strings, and int, and two more strings". Even if you choose to use one of the other PyObject_Call* functions you'll find it easier to use Py_BuildValue this way too.

Evandro Coan
  • 8,560
  • 11
  • 83
  • 144
DavidW
  • 29,336
  • 6
  • 55
  • 86
0

I managed to fix it by replacing the function PyObject_CallFunction with PyObject_CallFunctionObjArgs function:

PyObject* openfile = PyObject_CallFunction( openfunction, 
       "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );
// -->
PyObject* filepathpy = Py_BuildValue( "s", filepath );
PyObject* openmodepy = Py_BuildValue( "s", "r" );
PyObject* buffersizepy = Py_BuildValue( "i", -1 );
PyObject* encodingpy = Py_BuildValue( "s", "UTF-8" );
PyObject* ignorepy = Py_BuildValue( "s", "ignore" );

PyObject* openfile = PyObject_CallFunctionObjArgs( openfunction, 
        filepathpy, openmodepy, buffersizepy, encodingpy, ignorepy, NULL );

Long version as C Python likes:

PyObject* filepathpy = Py_BuildValue( "s", filepath );
if( filepathpy == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* openmodepy = Py_BuildValue( "s", "r" );
if( openmodepy == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* buffersizepy = Py_BuildValue( "i", -1 );
if( buffersizepy == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* encodingpy = Py_BuildValue( "s", "UTF-8" );
if( encodingpy == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* ignorepy = Py_BuildValue( "s", "ignore" );
if( ignorepy == NULL ) {
    PyErr_PrintEx(100); return;
}

PyObject* openfile = PyObject_CallFunctionObjArgs( openfunction,
        filepathpy, openmodepy, buffersizepy, encodingpy, ignorepy, NULL );
Py_DECREF( filepathpy );
Py_DECREF( openmodepy );
Py_DECREF( buffersizepy );
Py_DECREF( encodingpy );
Py_DECREF( ignorepy );

if( openfile == NULL ) {
    PyErr_PrintEx(100); return;
}
Evandro Coan
  • 8,560
  • 11
  • 83
  • 144