3

I'm trying to read an ascii file(text-based) of a particular format. I have done some line-profiling and post of the time is being used in looping. I'm trying if the code inside the loop could be improved in performance.

Things I've tried

  1. Faster indexing of numpy array by initializing numpy array with buffer interface as in official docs, which I expected to speed up a lot but barely made a difference.

  2. Custom type conversion function(no python interaction) to replace int(line[0:5]) but it ended being rather costly

The custom function for type conversion

cdef int fast_atoi(str buf):
    cdef int i=0 ,c = 0, x = 0
    for i in range(5):
        c = buf[i]
        if c > 47 and c < 58:
            x = x * 10 + c - 48
    return x

The main code-block which I want to optimize

def func(filename):
        cdef np.ndarray[np.int32_t] a1
        cdef np.ndarray[object] a2
        cdef np.ndarray[object] a3
        cdef np.ndarray[np.int32_t] a4
        cdef int count = 0
        cdef int n_lines
        cdef str line
        with open(filename) as inf:
            next(inf)
            n_lines = int(next(inf))
            a1 = np.zeros(n_atoms, dtype=np.int32)
            a2 = np.zeros(n_atoms, dtype=object)
            a3 = np.zeros(n_atoms, dtype = object)
            a4 =  np.zeros(n_atoms, dtype=np.int32)
            for i,line in enumerate(inf):
                if i == n_lines:
                    break
                try:
                    a1[i] =  int(line[0:5]) #custom function fast_atoi(line[0:5])
                    a2[i] = line[5:10].strip()
                    a3[i] = line[10:15].strip()
                    a4[i] = int(line[15:20])
                except (ValueError, TypeError) as e:
                    break

I'm having a 4.3 mb file

Author
n_lines
    1xyz      A    1   5.202   4.356   3.155
    1mno     A1    2   5.119   4.411   3.172
    1mno     A2    3   5.155   4.283   3.104
    1nnn     B3    4   5.247   4.318   3.237
    1xax     KA    5   5.306   4.421   3.075
    1ooo     MA    6   5.383   4.347   3.054
    1cbd     NB    7   5.257   4.474   2.941
    1orc     OB1   8   5.189   4.404   2.893

Current implementation takes 76ms on average on my machine, Adding the custom function mentioned makes it worse.

I'd be very grateful if some improvements could be suggested. I'm new to cython.

Fenil
  • 396
  • 1
  • 5
  • 16
  • 2
    Have you tried the `pandas` csv reader? – hpaulj Apr 02 '19 at 05:31
  • Yes, just more costly. As there is processing involved with strings as the file is being read is read which is not so direct while reading pandas. – Fenil Apr 02 '19 at 09:33
  • You probably want to turn off bounds checking for your function. It may also be necessary to convert your str to a character array, I don't know offhand if cython automatically optimizes python string access. You can get an annotated report from the cython compiler with `cython -a`, any lines that are highlighted in yellow in the report have some python overhead which you will want to get rid of. – ngoldbaum Apr 02 '19 at 18:14
  • Thanks for the suggestions @ngoldbaum. – Fenil Apr 03 '19 at 14:25

0 Answers0