I'm trying to read an ascii file(text-based) of a particular format. I have done some line-profiling and post of the time is being used in looping. I'm trying if the code inside the loop could be improved in performance.
Things I've tried
Faster indexing of numpy array by initializing numpy array with buffer interface as in official docs, which I expected to speed up a lot but barely made a difference.
Custom type conversion function(no python interaction) to replace int(line[0:5]) but it ended being rather costly
The custom function for type conversion
cdef int fast_atoi(str buf):
cdef int i=0 ,c = 0, x = 0
for i in range(5):
c = buf[i]
if c > 47 and c < 58:
x = x * 10 + c - 48
return x
The main code-block which I want to optimize
def func(filename):
cdef np.ndarray[np.int32_t] a1
cdef np.ndarray[object] a2
cdef np.ndarray[object] a3
cdef np.ndarray[np.int32_t] a4
cdef int count = 0
cdef int n_lines
cdef str line
with open(filename) as inf:
next(inf)
n_lines = int(next(inf))
a1 = np.zeros(n_atoms, dtype=np.int32)
a2 = np.zeros(n_atoms, dtype=object)
a3 = np.zeros(n_atoms, dtype = object)
a4 = np.zeros(n_atoms, dtype=np.int32)
for i,line in enumerate(inf):
if i == n_lines:
break
try:
a1[i] = int(line[0:5]) #custom function fast_atoi(line[0:5])
a2[i] = line[5:10].strip()
a3[i] = line[10:15].strip()
a4[i] = int(line[15:20])
except (ValueError, TypeError) as e:
break
I'm having a 4.3 mb file
Author
n_lines
1xyz A 1 5.202 4.356 3.155
1mno A1 2 5.119 4.411 3.172
1mno A2 3 5.155 4.283 3.104
1nnn B3 4 5.247 4.318 3.237
1xax KA 5 5.306 4.421 3.075
1ooo MA 6 5.383 4.347 3.054
1cbd NB 7 5.257 4.474 2.941
1orc OB1 8 5.189 4.404 2.893
Current implementation takes 76ms on average on my machine, Adding the custom function mentioned makes it worse.
I'd be very grateful if some improvements could be suggested. I'm new to cython.