I'm relatively new to Python and this is my first attempt at writing a C extension.
Background In my Python 3.X project I need to load and parse large binary files (10-100MB) to extract data for further processing. The binary content is organized in frames: headers followed by a variable amount of data. Due to the low performance in Python I decided to go for a C extension to speedup the loading part.
The standalone C code outperforms Python by a factor in between 20x-500x so I am pretty satisfied with it.
The problem: the memory keeps growing when I invoke the function from my C-extension multiple times within the same Python module.
my_c_ext.c
#include <Python.h>
#include <numpy/arrayobject.h>
#include "my_c_ext.h"
static unsigned short *X, *Y;
static PyObject* c_load(PyObject* self, PyObject* args)
{
char *filename;
if(!PyArg_ParseTuple(args, "s", &filename))
return NULL;
PyObject *PyX, *PyY;
__load(filename);
npy_intp dims[1] = {n_events};
PyX = PyArray_SimpleNewFromData(1, dims, NPY_UINT16, X);
PyArray_ENABLEFLAGS((PyArrayObject*)PyX, NPY_ARRAY_OWNDATA);
PyY = PyArray_SimpleNewFromData(1, dims, NPY_UINT16, Y);
PyArray_ENABLEFLAGS((PyArrayObject*)PyY, NPY_ARRAY_OWNDATA);
PyObject *xy = Py_BuildValue("NN", PyX, PyY);
return xy;
}
...
//More Python C-extension boilerplate (methods, etc..)
...
void __load(char *) {
// open file, extract frame header and compute new_size
X = realloc(X, new_size * sizeof(*X));
Y = realloc(Y, new_size * sizeof(*Y));
X[i] = ...
Y[i] = ...
return;
}
test.py
import my_c_ext as ce
binary_files = ['file1.bin',...,'fileN.bin']
for f in binary_files:
x,y = ce.c_load(f)
del x,y
Here I am deleting the returned objects in hope of lowering memory usage.
After reading several posts (e.g. this, this and this), I am still stuck.
I tried to add/remove the PyArray_ENABLEFLAGS
setting the NPY_ARRAY_OWNDATA
flag without experiencing any difference. It is not yet clear to me if the NPY_ARRAY_OWNDATA
implies a free(X)
in C. If I explicitly free the arrays in C, I ran into a segfault
when trying to load second file in the for loop in test.py
.
Any idea of what am I doing wrong?