How to create a custom numpy dtype using cython

Question

There are examples for creating custom numpy dtypes using C here:

Additionally, it seems to be possible to create custom ufuncs in cython:

It seems like it should also be possible to create a dtype using cython (and then create custom ufuncs for it). Is it possible? If so, can you post an example?

USE CASE:

I want to do some survival analysis. The basic data elements are survival times (floats) with associated censor values (False if the associated time represents a failure time and True if it instead represents a censoring time (i.e., no failure occurred during the period of observation)).

Obviously I could just use two numpy arrays to store these values: a float array for the times and a bool array for the censor values. However, I want to account for the possibility of an event occurring multiple times (this is a good model for, say, heart attacks - you can have more than one). In this case, I need an array of objects which I call MultiEvents. Each MultiEvent contains a sequence of floats (uncensored failure times) and an observation period (also a float). Note that the number of failures is not the same for all MultiEvents.

I need to be able to perform a few operations on an array of MultiEvents:

Get the number of failures for each
Get the censored time (that is the period of observation minus the sum of all failure times)
Calculate a log likelihood based on additional arrays of parameters (such as an array of hazard values). For example, the log likelihood for a single MultiEvent M and constant hazard value h would be something like:

sum(log(h) + h*t for t in M.times) - h*(M.period - sum(M.times))

where M.times is the list (array, whatever) of failure times and M.period is the total observation period. I want the proper numpy broadcasting rules to apply, so that I can do:

log_lik = logp(M_vec,h_vec)

and it will work as long as the dimensions of M_vec and h_vec are compatible.

My current implementation uses numpy.vectorize. That works well enough for 1 and 2, but it is too slow for 3. Note also that I can't do this because the number of failures in my MultiData objects is not known ahead of time.

Is your reason for asking because you find writing cython simpler than writing C? I suspect that if it is possible (which I don't know), you will end up with code that is just as complex and messy as C, so there may not be any benefit. — DaveP, Nov 05 '12 at 08:14
@DaveP There are two reasons. One is that I find it simpler to write in cython than C. The other is that I would like to make this process easy for python programmers to repeat for new dtypes and ufuncs. I am hoping that I can wrap most of the complexity and make defining dtypes a simple thing to do in cython. That said, cython is something I only learned about last week. I've been playing with it, but at this point I do not fully understand its capabilities. — jcrudy, Nov 05 '12 at 19:19
have you considered using [pandas](http://pandas.pydata.org/) — btel, Nov 11 '12 at 14:30

btel · Answer 1 · 2012-11-12T12:45:40.750

Numpy arrays are most suitable for data types with fixed size. If the objects in the array are not fixed size (such as your MultiEvent) the operations can become much slower.

I would recommend you to store all of the survival times in a 1d linear record array with 3 fields: event_id, time, period. Each event can appear mutliple times in the array:

>>> import numpy as np
>>> rawdata = [(1, 0.4, 4), (1, 0.6, 6), (2,2.6, 6)]
>>> npdata = np.rec.fromrecords(rawdata, names='event_id,time,period')
>>> print npdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6) (2, 2.6000000000000001, 6)]

To get data for a specific index you could use fancy indexing:

>>> eventdata = npdata[npdata.event_id==1]
>>> print eventdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6)]

The advantage of this approach is that you can easily intergrate it with your ndarray-based functions. You can also access this arrays from cython as described in the manual:

cdef packed struct Event:
    np.int32_t event_id
    np.float64_t time
    np.float64_6 period

def f():
    cdef np.ndarray[Event] b = np.zeros(10,
        dtype=np.dtype([('event_id', np.int32),
                        ('time', np.float64),
                        ('period', np.float64)]))
    <...>

score 0 · Answer 2 · answered Nov 09 '12 at 06:26

I apologise for not answering the question directly, but I've had similar problems before, and if I understand correctly, the real problem you're now having is that you have variable-length data, which is really, really not one of the strengths of numpy, and is the reason you're running into performance issues. Unless you know in advance the maximum number of entries for a multievent, you'll have problems, and even then you'll be wasting loads of memory/disk space filled with zeros for those events that aren't multi events.

You have data points with more than one field, some of which are related to other fields, and some of which need to be identified in groups. This hints strongly that you should consider a database of some form for storing this information, for performance, memory, space-on-disk and sanity reasons.

It will be much easier for a person new to your code to understand a simple database schema than a complicated, hacked-on-numpy structure that will be frustratingly slow and bloated. SQL queries are quick and easy to write in comparison.

I would suggest based on my understanding of your explanation having Event and MultiEvent tables, where each Event entry has a foreign key into the MultiEvent table where relevant.

The issue with this solution is that I couldn't then use the large collection of tools already built on top of numpy. An extension of this answer would be to create a numpy array (or a pandas DataFrame) based on a join between my MultiEvent table and the table holding my scalar variables. However, for my purpose (using a particular set of pymc models), that is also unfortunately not an option. — jcrudy, Nov 09 '12 at 18:25
@user1572508 have you considered something like [PyTables](http://www.pytables.org/moin)? — John Lyon, Nov 11 '12 at 10:34

How to create a custom numpy dtype using cython

2 Answers2

Linked

Related