What is the fastest way to sample slices of numpy arrays?

Question

I have a 3D (time, X, Y) numpy array containing 6 hourly time series for a few years. (say 5). I would like to create a sampled time series containing 1 instance of each calendar day randomly taken from the available records (5 possibilities per day), as follows.

Jan 01: 2006
Jan 02: 2011
Jan 03: 2009
...

this means I need to take 4 values from 01/01/2006, 4 values from 02/01/2011, etc. I have a working version which works as follows:

Reshape the input array to add a "year" dimension (Time, Year, X, Y)
Create a 365 values array of randomly generated integers between 0 and 4
Use np.repeat and array of integers to extract only the relevant values:

Example:

sampledValues = Variable[np.arange(numberOfDays * ValuesPerDays), sampledYears.repeat(ValuesPerDays),:,:]

This seems to work, but I was wondering if this is the best/fastest approach to solve my problem? Speed is important as I am doing this in a loop, adn would benefit from testing as many cases as possible.

Am I doing this right?

Thanks

EDIT I forgot to mention that I filtered the input dataset to remove the 29th of feb for leap years.

Basically the aim of that operation is to find a 365 days sample that matches well the long term time series in terms on mean etc. If the sampled time series passes my quality test, I want to export it and start again.

eumiro · Answer 1 · 2011-10-21T12:42:40.057

3

The year 2008 was 366 days long, so don't reshape.

Have a look at scikits.timeseries:

import scikits.timeseries as ts

start_date = ts.Date('H', '2006-01-01 00:00')
end_date = ts.Date('H', '2010-12-31 18:00')
arr3d = ... # your 3D array [time, X, Y]

dates = ts.date_array(start_date=start_date, end_date=end_date, freq='H')[::6]
t = ts.time_series(arr3d, dates=dates)
# just make sure arr3d.shape[0] == len(dates) !

Now you can access the t data with day/month/year objects:

t[np.logical_and(t.day == 1, t.month == 1)]

so for example:

for day_of_year in xrange(1, 366):
    year = np.random.randint(2006, 2011)

    t[np.logical_and(t.day_of_year == day_of_year, t.year == year)]
    # returns a [4, X, Y] array with data from that day

Play with the attributes of t to make it work with leap years too.

edited Oct 21 '11 at 12:42

answered Oct 21 '11 at 12:11

eumiro

207,213
34
299
261

This looks like a promising approach! – heltonbiker Oct 21 '11 at 12:34
I should have mentioned it but I don't really care about leap years in this case, as I already removed all the Feb 29th occurrences in the input time series. I though of using scikits.timeseries, however I am not sure I would really benefit from it in terms of speed. In addition I may want to start my days at 6:00 or 12:00, so I don't really want to have to create a array of datetime objects to extract every time when I could just use my sampled array (rs=np.random.randint(0, np.size(years), size=365) ) straight away. But I may be wrong! – Jahfet Oct 21 '11 at 13:29

score 0 · Answer 2 · answered Oct 21 '11 at 12:34

0

I don't see a real need to reshape the array, since you can embed the year-size information in your sampling process, and leave the array with its original shape.

For example, you can generate a random offset (from 0 to 365), and pick the slice with index, say, n*365 + offset.

Anyway, I don't think your question is complete, because I didn't quite understand what you need to do, or why.

answered Oct 21 '11 at 12:34

heltonbiker

26,657
28
137
252

I don't know if the reshape operation is needed or not, I just thought it would be convenient for me as I can basically select which year I want to extract for each day very easily. I just have to do it once before entering my sampling loop so I though that would have no impact on performance. I added a few details to the question, hopefully you will understand better what I am after. – Jahfet Oct 21 '11 at 13:42

What is the fastest way to sample slices of numpy arrays?

2 Answers2