4

I want to compile a python function with cython, for reading a binary file skipping some records (without reading the whole file and then slicing, as I would run out of memory). I can come up with something like this:

    def FromFileSkip(fid, count=1, skip=0):            
        if skip>=0:
            data = numpy.zeros(count)
            k = 0
            while k<count:
                try:
                    data[k] = numpy.fromfile(fid, count=1, dtype=dtype)
                    fid.seek(skip, 1)
                    k +=1
                except ValueError:
                    data = data[:k]
                    break
            return data

and then I can use the function like this:

 f = open(filename)
 data = FromFileSkip(f,...

However, for compiling the function "FromFileSkip" with cython, I would like to define all the types involved in the function, so "fid" as well, the file handler. How can I define its type in cython, as it is not a "standard" type, e.g. an integer. Thanks.

user2061949
  • 205
  • 1
  • 3
  • 7
  • 3
    Why is it important to type that variable? Since it's a python object you wont obtain any speed up. – Bakuriu Mar 12 '13 at 08:44
  • If you want to assign it to a class variable, you use the `object` type. – Henry Gomersall Mar 12 '13 at 09:10
  • 1
    So typing the file handle would not change much? I thought that typing all variables, without exceptions, improved performance compared to typing just some of them. – user2061949 Mar 12 '13 at 09:53
  • you should avoid calling numpy.fromfile inside the loop, because it is a python function and it will most probably bottleneck all your efforts. Consider using low-level C stdio methods for speed. Some examples are here: https://groups.google.com/forum/?fromgroups=#!topic/cython-users/Px1nLMe7dZY – dmytro Mar 12 '13 at 13:54
  • The file handle is of type `file`, which is a builtin. Whether that will help cython any I have no clue. Since it's a type implemented in C, it might be able to avoid going through the interpreter to call its methods. That said the documentation seems to imply cython doesn't really do anything special for any non-primitive types. – millimoose Mar 12 '13 at 22:42

1 Answers1

5

Defining the type of fid won't help because calling python functions is still costly. Try compiling your example with "-a" flag to see what I mean. However, you can use low-level C functions for file handling to avoid python overhead in your loop. For the sake of example, I assumed that the data starts right from the beginning of the file and that its type is double

from libc.stdio cimport *                                                                

cdef extern from "stdio.h":
    FILE *fdopen(int, const char *)

import numpy as np
cimport numpy as np

DTYPE = np.double # or whatever your type is
ctypedef np.double_t DTYPE_t # or whatever your type is

def FromFileSkip(fid, int count=1, int skip=0):
    cdef int k
    cdef FILE* cfile
    cdef np.ndarray[DTYPE_t, ndim=1] data
    cdef DTYPE_t* data_ptr

    cfile = fdopen(fid.fileno(), 'rb') # attach the stream
    data = np.zeros(count).astype(DTYPE)
    data_ptr = <DTYPE_t*>data.data

    # maybe skip some header bytes here
    # ...

    for k in range(count):
        if fread(<void*>(data_ptr + k), sizeof(DTYPE_t), 1, cfile) < 0:
            break
        if fseek(cfile, skip, SEEK_CUR):
            break

    return data

Note that the output of cython -a example.pyx shows no python overhead inside the loop.

dmytro
  • 1,293
  • 9
  • 21