0

I'm converting a pure-Python module to a C-extension to familiarize myself with the C API.

The Python implementation is as follows:

_CRC_TABLE_ = [0] * 256

def initialize_crc_table():
    if _CRC_TABLE_[1] != 0:  # Safeguard against re-initialization
        return
    # snip

def calculate_crc(data: bytes, initial: int = 0) -> int:
    if _CRC_TABLE_[1] == 0:  # In case user forgets to initialize first
        initialize_crc_table()
    # snip

# additional non-CRC methods trimmed

My C-extension thus far works:

#include <Python.h>

static Py_ssize_t CRC_TABLE_LEN = 256;
PyObject *_CRC_TABLE_;

static PyObject *method_initialize_crc_table(PyObject *self, PyObject *args) {
   // snip
}

static PyMethodDef module_methods[] = {
  {"initialize_crc_table", method_initialize_crc_table, METH_VARARGS, NULL},
  {NULL, NULL, 0, NULL}
};

void _allocate_table_() {
  _CRC_TABLE = PyList_New(CRC_TABLE_LEN);
  PyObject *zero = Py_BuildValue("i", 0);
  for (int i = 0; i < CRC_TABLE_LEN; i++) {
    PyList_SetItem(_CRC_TABLE_, i, zero);
  }
}

#if PY_MAJOR_VERSION >= 3
static struct PyModuleDef module_utilities = {
  PyModuleDef_HEAD_INIT,
  "utilities",
  NULL,
  -1,
  module_methods,
};

PyMODINIT_FUNC PyInit_utilities() {
  PyObject *module = PyModule_Create(&module_utilities);
  _allocate_table_();
  PyModule_AddObject(module, "_CRC_TABLE", _CRC_TABLE_);
  return module;
}
#else
PyMODINIT_FUNC initutilities() {
  PyObject *module = Py_InitModule3("utilities", module_methods, NULL);
  _allocate_table_();
  PyModule_AddObject(module, "_CRC_TABLE", _CRC_TABLE_);
}

I am able to access utilities._CRC_TABLE_ from the C-extension in the interpreter and values match the Python-equivalent when invoking utilities.intialize_crc_table.

Now I'm trying to call initialize_crc_table at the start of calculate_crc, performing the same check as used in the Python implementation. I'm returning None for now:

static PyObject *method_calculate_crc(PyObject *self, PyObject *args) {
  if (!(uint)PyLong_AsUnsignedLong(PyList_GetItem(_CRC_TABLE_, (Py_ssize_t) 1))) {
    PyObject *call_initialize_crc_table = PyObject_GetAttrString(self, "initialize_crc_table");
    PyObject_CallObject(call_initialize_crc_table, NULL);
    Py_DECREF(call_initialize_crc_table);
  }
  Py_RETURN_NONE;
}

I've added this to module_methods[] and it compiles without warnings or errors. When I run this method within the interpreter, I get a segfault. I assume it's because self isn't the module as an object.

I can do this as an alternative, which appears to work without issue:

static PyObject *method_calculate_crc(PyObject *self, PyObject *args) {
  if (!(uint)PyLong_AsUnsignedLong(PyList_GetItem(_CRC_TABLE_, (Py_ssize_t) 1))) {
    method_initialize_crc_table(self, NULL);
  }
  Py_RETURN_NONE;
}

However, I am not certain if I should be passing self, NULL, or something else to the method.

What is the proper way of invoking method_initialize_crc_table from method_calculate_crc?

Kamikaze Rusher
  • 271
  • 2
  • 10
  • `self` is normally unused for module-level functions. – DavidW Apr 08 '20 at 21:14
  • @DavidW I haven't checked it yet but I assume that means `self` is equal to `Null`? If so, then me calling `PyObject_GetAttrString` probably attributed to the segfault. In which case, is it safe/proper to just directly call a method directly within a module? – Kamikaze Rusher Apr 08 '20 at 21:29
  • Yes. If you really want to use `PyObject_GetAttrString` then you probably _do_ have a copy of the module object as a global. But I'd just call the functions directly (like in your second case) – DavidW Apr 09 '20 at 06:12
  • So looking it at [documentation, `self` should be the module object](https://docs.python.org/3/c-api/structures.html#METH_VARARGS). In my experience it's pretty rare to use it, so your second method is still "right". If you want to use the first method then you should check all the return values against `NULL` as you go to spot errors as they happen. – DavidW Apr 09 '20 at 06:59
  • @DavidW I checked `self` and it is indeed `NULL` even though documentation states that using `METH_VARARGS` requires a `self` and `args` parameter. So I'm under the opinion that I must call the method directly. That's fine, but it would also imply that any module globals that have been added via `PyModule_AddObject` would not be accessible. I do have a pointer to the `_CRC_TABLE_`, but examples that I have seen don't keep global pointers. Anyways, it's not quite clear cut. – Kamikaze Rusher Apr 09 '20 at 13:37
  • You could always store your module pointer as a global variable in the `PyInit_...` function if you need access to it. – DavidW Apr 09 '20 at 19:08
  • @DavidW do I need to worry about de-referencing the pointer, or will Python handle that when exiting? Just wanting to avoid any potential memory leaks. – Kamikaze Rusher Apr 09 '20 at 21:38
  • Python will definitely handle it when exiting (memory leaks never persist after the program has exited). Extension modules don't typically get unloaded before exit. – DavidW Apr 10 '20 at 06:35

1 Answers1

0

There was a "gotcha" here that I must clarify on. While the code was intended for Python 3, development was initially done in Python 2 as the development files were not yet available on the machine I was using. This shed some light on some differences in how each version handles things. David's comments helped lead to this clarification.

If a method is defined as METH_VARARGS but is defined for a module (versus a class), Python 2 does not pass anything for the PyObject *self parameter. This is noted in the documentation but is easy to overlook if you're not careful. Python 3, however, does pass a pointer to the module. As DavidW recommended, I implemented a global variable to hold a reference to the module. Assuming his claims of Python handling the de-referencing at exit are correct, we can safely use this for accessing module global attributes.

With our issue of PyObject *self solved, we no longer get a segfault. We can then address the question of which approach is (seemingly more) correct for calling a method within the local scope of the module. Do we do this:

if (/* conditional */)
    PyObject_CallMethod(module, "initialize_crc_table", NULL);

Or this:

if (/* conditional */)
    method_initialize_crc_table(self, args, kwargs);

Benchmarks seem to provide an answer here. Using Python's built-in timeit module, we can see a very clear difference in terms of performance. Note that so far in our implementation, .calculate_crc accesses ._CRC_TABLE_ and checks if it's initialized, but no processing occurs. Performance compared to Python 2 and 3 were identical and thus ignored.

The command is as follows:

python3 -m timeit "import utilities; utilities.calculate_crc(0)"

PyObject_CallMethod: 874 nsec per loop method_initialize_crc_table: 44.3 usec per loop

Using the PyObject_ function is reported as 50x faster, quite a significant difference. Benchmarks alone do not facilitate what is "more correct" but with no clear guidance it may be a sufficient justification for our use. Therefore, I will be using PyObject_ calls for this project.

Kamikaze Rusher
  • 271
  • 2
  • 10