There are examples for creating custom numpy dtypes using C here:
Additionally, it seems to be possible to create custom ufuncs in cython:
It seems like it should also be possible to create a dtype using cython (and then create custom ufuncs for it). Is it possible? If so, can you post an example?
USE CASE:
I want to do some survival analysis. The basic data elements are survival times (floats) with associated censor values (False if the associated time represents a failure time and True if it instead represents a censoring time (i.e., no failure occurred during the period of observation)).
Obviously I could just use two numpy arrays to store these values: a float array for the times and a bool array for the censor values. However, I want to account for the possibility of an event occurring multiple times (this is a good model for, say, heart attacks - you can have more than one). In this case, I need an array of objects which I call MultiEvent
s. Each MultiEvent
contains a sequence of floats (uncensored failure times) and an observation period (also a float). Note that the number of failures is not the same for all MultiEvent
s.
I need to be able to perform a few operations on an array of MultiEvent
s:
Get the number of failures for each
Get the censored time (that is the period of observation minus the sum of all failure times)
Calculate a log likelihood based on additional arrays of parameters (such as an array of hazard values). For example, the log likelihood for a single
MultiEvent
M
and constant hazard valueh
would be something like:sum(log(h) + h*t for t in M.times) - h*(M.period - sum(M.times))
where M.times
is the list (array, whatever) of failure times and M.period
is the total observation period. I want the proper numpy broadcasting rules to apply, so that I can do:
log_lik = logp(M_vec,h_vec)
and it will work as long as the dimensions of M_vec
and h_vec
are compatible.
My current implementation uses numpy.vectorize
. That works well enough for 1 and 2, but it is too slow for 3. Note also that I can't do this because the number of failures in my MultiData objects is not known ahead of time.