0

I have a numpy array roughly like so:

data    
array([(datetime.datetime(2009, 1, 6, 2, 30), 17924.0, 0.0),....
           (datetime.datetime(2009, 1, 29, 16, 30), 35249.2, 521.25], 
         dtype=[('timestamp', '|O4'), ('x1', '<f8'), ('x2', '<f8')])

I would like to be able to index the data based on the first column (i.e. with the datetime objects), so I can access a particular year / month / day worth of data, with something like this:

data[data['timestamp'].year == 2009]

This obviously doesn't work. The only thing I can think of doing is adding additional columns (e.g. a "year" column), so this would work:

data[data['year'] == 2009]

Seems like a fairly inefficient way of doing things (and would duplicate a lot of data) - particularly if I want to index on all the other time intervals as well... is there a better way to do this?

Thanks in advance.

djmac
  • 827
  • 5
  • 11
  • 27

3 Answers3

3

Use pandas. "pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

There are tons of examples in the documentation but you can do what you are looking to do like this:

import pandas
import numpy as np
import datetime as dt

# example values
dates = np.asarray(pandas.date_range('1/1/2000', periods=8))

# create a dataframe
df = pandas.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

# date you want
date=dt.datetime(2000,1,2)

# magic :)
print df.xs(date)

I suggest learning this module ASAP. It is absolutely exceptional. This is a very simple example. Check out the documentation which is very thorough.

Jason Strimpel
  • 14,670
  • 21
  • 76
  • 106
  • Thanks! Much appreciated - this looks very useful . . . and maybe I can stop using R – djmac Sep 05 '12 at 00:56
  • Sorry actually, I am not sure if this helps my original problem much. I don't want to be able to select a particular timestamp - I could already do that the numpy arrays. What I really need to be able to do is select a particular time period, and access all the data from that time period (not a single index). Eg. select data from month of the year or similar. Perhaps it is not possible without creating redundant columns of data. – djmac Sep 05 '12 at 03:23
1

Okay so I think I solved the problem (using pandas, as suggested by strimp099 above), specifically, using the "GroupBy" object (pandas: Group By: split-apply-combine)

To elaborate on the example used above:

import pandas
import numpy as np
import datetime as dt

# example values
dates = np.asarray(pandas.DateRange('1/1/2000', periods=200))

# create a dataframe
df = pandas.DataFrame(np.random.randn(200, 4), index=dates, columns=['A', 'B', 'C', 'D'])

# create a GroupBy object
grouped_data = df.groupby(lambda x: x.month)

#magic
grouped_data.mean()
              A         B         C         D
month                                        
1     -0.492648 -0.038257 -0.224924  0.130182
2     -0.178995  0.236042 -0.471791 -0.369913
3     -0.261866 -0.024680 -0.107211 -0.195742
4      0.215505  0.077079 -0.057511  0.146193
5     -0.097043 -0.335736  0.302811  0.120170
6      0.187583  0.221954 -0.290655 -0.077800
7     -0.134988  0.013719 -0.094334 -0.107402
8     -0.229138  0.056588 -0.156174 -0.067655
9      0.043746  0.077781  0.230035  0.344440
10    -0.533137 -0.683788  0.395286 -0.957894

(i.e. averages of the data grouped by month)

Also, to do multiple groupings, (i.e. in my case year and month), this might help:

grouped_data = df.groupby(lambda x: (x.year,x.month))

Cheers!

djmac
  • 827
  • 5
  • 11
  • 27
  • Well done. I just wanted to provide a simple example to whet your appetite. I'm sure if you dig into the documentation deep enough, you will be able to solve your issue with the built in functions. – Jason Strimpel Sep 05 '12 at 14:51
0

You can also make use of the datetime dtype in numpy. I haven't benchmarked the two approaches but they might be pretty close. Here is an example:

import datetime
import numpy as np


def data_in(dates, year=2009):
    """ Return the dates within the given year. 
    Works only with dates being a numpy array with a datetime dtype.
    """

    from_date = np.array(('{}-01-01'.format(year), ), dtype='M8')
    to_date = np.array(('{}-12-31'.format(year),), dtype='M8')

    return dates[(dates > from_date) & (dates < to_date)]


if __name__ == '__main__':

    data = np.array(
        [
            (datetime.datetime(2009, 1, 6, 2, 30), 17924.0, 0.0),
            (datetime.datetime(2009, 1, 29, 16, 30), 35249.2, 521.25),
            (datetime.datetime(2011, 1, 29, 16, 30), 35249.2, 521.25),
        ], 
        dtype=[('timestamp', 'M8'), ('x1', '<f8'), ('x2', '<f8')]
    )

    for year in [2009, 2010, 2011]:
        print ' Timestamps in {}:\n {}'.format( year, data_in(data['timestamp'], year))
dpinte
  • 432
  • 2
  • 4