1
from libcpp.algorithm cimport sort as stdsort
from libcpp.algorithm cimport unique
from libcpp.vector cimport vector
# from libcpp cimport bool
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef class Vector:
    cdef vector[cython.int] wrapped_vector

    # the easiest thing to do is add short wrappers for the methods you need
    def push_back(self, int num):
        self.wrapped_vector.push_back(num)

    def sort(self):
        stdsort(self.wrapped_vector.begin(), self.wrapped_vector.end())

    def unique(self):
        self.wrapped_vector.erase(unique(self.wrapped_vector.begin(), self.wrapped_vector.end()), self.wrapped_vector.end())


    def __str__(self):
        return "[" + ", ".join([str(i) for i in self.wrapped_vector]) + "]"

    def __repr__(self):
        return str(self)

    def __len__(self):
        return self.wrapped_vector.size()

    @cython.boundscheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    def __setitem__(self, int key, int item):
        self.wrapped_vector[key] = item

    @cython.boundscheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    def __getitem__(self, int key):
        return self.wrapped_vector[key]

I have tried to wrap vectors so that I can use them in Python dicts.

This seems to create crazy amounts of overhead. See line 72 and 75 for example. They just add an integer to the number already in the vector:

enter image description here

Is it possible to remove this overhead or is this the price I pay to wrap vectors?

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • You are creating a Python object, a dictionary. But by the way, it looks like this just produces `tags` for the last file, throwing away the results for the previous reads. – hpaulj Oct 08 '18 at 16:30
  • Yes, the logic of the code is not important yet, but good catch :) – The Unfun Cat Oct 08 '18 at 16:31
  • Where am I creating a Python dict though? – The Unfun Cat Oct 08 '18 at 16:32
  • What does `add_reads_to_dict` do? I see several uses of `tags.values()` and `tags.items()`, Python dictionary methods. – hpaulj Oct 08 '18 at 16:42
  • Sorry! Important info. add_reads_to_dict creates a dict of vectors. So the `v`s above are vectors :) – The Unfun Cat Oct 08 '18 at 16:44
  • And the lines involving just `v` are a paler yellow, indicating less overhead. I was focusing on the bright yellow lines. Did you expand those lines, 72, 75, 78 to see what they are doing? – hpaulj Oct 08 '18 at 16:53
  • As long as you use def-functions you will get the overhead of creating python-objects. You either have to use the cdef-Versions of functions or vectorize your operations - similar to numpy. – ead Oct 08 '18 at 17:23

1 Answers1

1

This seems to be based on my answer to another question. The purpose of adding __getitem__ and __setitem__ to the cdef class Vector is purely so that it can be indexed from Python. From Cython you can index into the C++ vector directly for extra speed.

At the start of your files_to_bins add the line:

cdef Vector v

This will get Cython to make sure that anything assigned to v is a Vector object (it'll raise a TypeError if not) and thus you'll be allowed to access its cdef attributes directly.

Then change the line:

v[i] = v[i] + half_fragment_size

to:

v.wrapped_vector[i] = v.wrapped_vector[i] + half_fragment_size

(and similarly for the other indexing lines)


Be aware that boundscheck(False) and wraparound(False) is doing absolutely nothing for C++ objects. The C++ indexing operator performs no bounds checking (and Cython doesn't add it in) and it does not support negative indexing either. boundscheck and wraparound only apply to indexing memoryviews or numpy arrays.

DavidW
  • 29,336
  • 6
  • 55
  • 86