1

This is more of a conceptual question than a specific code question. We're looking at optimizing some of our lower level "core" classes in our python code. There's currently a class that essentially stores a dictionary of features, and a sorted list of things per feature. With the caveat of, what the feature represents could be any data type.

Example:

blah = BasicClass()
print blah.data
>> {'a': [(1, 'foo'), (2, 'bar'), (10, 'baz')],
    'b': [(100, -133231), (236, -99594)],
    'c': [(27, [1,2]), (35, [1,2,3,4])]}

This class is mostly utility functions for looking up values in the data field given certain conditions.

Methods like:

get_first_value_after(feat, val) - Given one of the features, and some number, find the first entry in data[feat] that has an index > val.

get_values_in_range(feat, start_val, end_val) - Given a range, find all parts of data[feat] that have their index between start_val, and end_val.

So on and so forth.

I'm looking at optimizing this class to be as performant as possible, as it gets called in a lot of our stack. I've been looking into incorporating C as that would theoretically offer some gains, but it seems as if there's many ways of extending into C, and I'm not sure which path to take.

Off the top of my head:

  1. Ctypes functions - create the functions in C with no reference to the python header file. Everything would use native c, and be used through the ctypes.CDLL call. Expecting the python to ctype convert before passing in.

  2. C functions module - create a helper functions module that uses pythons C API and returns PyObjects correctly. Then in my class above, I'd just make methods that pass through to the c module functions

    def get_first_val_after(self, feat, val): return get_first_val_after_c(self.data[feat], val)

  3. Rewrite the whole class in C - This would allow our underlying data structure to be native c, and the functions would potentially be faster.

I have a bit of experience with 1 and 2, but I've never done 3 before, I would hoping someone would be able to give me insight into which would potentially yield the best results.

Thank you in advance.

EDIT: As mentioned below, we currently cythonize the whole file. I was looking to test if extending into C proper would yield better results.

Lzkatz
  • 173
  • 8
  • You seem to have missed out [`Cython`](http://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html) from your list of options – roganjosh Apr 11 '18 at 20:58
  • Fair point, we actually currently cythonize the file. I was wondering if extending in proper C would yield more performance. – Lzkatz Apr 11 '18 at 21:26
  • How are you implementing get_first_value_after and get_values_in_range? Are you using bisect as your lists appear to be sorted? – Dan D. Apr 11 '18 at 21:30
  • At the moment we aren't using bisect because they're lists of lists. Although bisect would probably be a quick improvement as well. – Lzkatz Apr 11 '18 at 21:45

0 Answers0