17

I have a Pandas dataframe which is indexed by a DatetimeIndex:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23
Data columns:
Date(dd-mm-yy)_Time(hh-mm-ss)       53732  non-null values
Julian_Day                          53732  non-null values
AOT_870                             53732  non-null values
440-870Angstrom                     53732  non-null values
440-675Angstrom                     53732  non-null values
500-870Angstrom                     53732  non-null values
Last_Processing_Date(dd/mm/yyyy)    53732  non-null values
Solar_Zenith_Angle                  53732  non-null values
time                                53732  non-null values
dtypes: datetime64[ns](2), float64(6), object(1)

I want to find the row that is closest to a certain time:

image_time = dateutil.parser.parse('2009-07-28 13:39:02')

and find how close it is. So far, I have tried various things based upon the idea of subtracting the time I want from all of the times and finding the smallest absolute value, but none quite seem to work.

For example:

aeronet.index - image_time

Gives an error which I think is due to +/- on a Datetime index shifting things, so I tried putting the index into another column and then working on that:

aeronet['time'] = aeronet.index
aeronet.time - image_time

This seems to work, but to do what I want, I need to get the ABSOLUTE time difference, not the relative difference. However, just running abs or np.abs on it gives an error:

abs(aeronet.time - image_time)

C:\Python27\lib\site-packages\pandas\core\series.pyc in __repr__(self)
   1061         Yields Bytestring in Py2, Unicode String in py3.
   1062         """
-> 1063         return str(self)
   1064 
   1065     def _tidy_repr(self, max_vals=20):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __str__(self)
   1021         if py3compat.PY3:
   1022             return self.__unicode__()
-> 1023         return self.__bytes__()
   1024 
   1025     def __bytes__(self):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __bytes__(self)
   1031         """
   1032         encoding = com.get_option("display.encoding")
-> 1033         return self.__unicode__().encode(encoding, 'replace')
   1034 
   1035     def __unicode__(self):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __unicode__(self)
   1044                     else get_option("display.max_rows"))
   1045         if len(self.index) > (max_rows or 1000):
-> 1046             result = self._tidy_repr(min(30, max_rows - 4))
   1047         elif len(self.index) > 0:
   1048             result = self._get_repr(print_header=True,

C:\Python27\lib\site-packages\pandas\core\series.pyc in _tidy_repr(self, max_vals)
   1069         """
   1070         num = max_vals // 2
-> 1071         head = self[:num]._get_repr(print_header=True, length=False,
   1072                                     name=False)
   1073         tail = self[-(max_vals - num):]._get_repr(print_header=False,

AttributeError: 'numpy.ndarray' object has no attribute '_get_repr'

Am I approaching this the right way? If so, how should I get abs to work, so that I can then select the minimum absolute time difference, and thus get the closest time. If not, what is the best way to do this with a Pandas time-series?

DSM
  • 342,061
  • 65
  • 592
  • 494
robintw
  • 27,571
  • 51
  • 138
  • 205

3 Answers3

20

This simple method will return the (integer index of the) TimeSeriesIndex entry closest to a given datetime object. There's no need to copy the index to a regular column - simply use the .to_pydatetime method instead.

import numpy as np

i = np.argmin(np.abs(df.index.to_pydatetime() - image_time))

Then you simply use the DataFrame's .iloc indexer:

df.iloc[i]

Here's a function to do this:

def fcl(df, dtObj):
    return df.iloc[np.argmin(np.abs(df.index.to_pydatetime() - dtObj))]

You can then further filter seamlessly, e.g.

fcl(df, dtObj)['column']
cmeeren
  • 3,890
  • 2
  • 20
  • 50
  • seems like there should be an easier way, but this was a nice solution that helped me keep going! – mishaF Jun 17 '15 at 15:48
8

I think you can try DatetimeIndex.asof to find the most recent label up to and including the input. Then use the returned datetime to select the appropriate row. If you only need values for a particular column, Series.asof exists and combines the two steps above into one.

This assumes you want the closest datetime. If you don't care about the date and just want the same time every day, use at_time in DataFrame.

Follow up:

Edit: false alarm, I had an older version locally. The latest on master should work with np.abs.

In [10]: np.abs(df.time - image_time)
Out[10]: 
0    27 days, 13:39:02
1    26 days, 13:39:02
2    25 days, 13:39:02
3    24 days, 13:39:02
4    23 days, 13:39:02
5    22 days, 13:39:02

Also just to clarify:

aeronet.index - image_time doesn't work because subtraction on Index is a set difference (back in the day Index used to be constrained to be unique).

Community
  • 1
  • 1
Chang She
  • 16,692
  • 8
  • 40
  • 25
  • 1
    Thanks - that helps. However, I don't think it'll do what I want all of the time. For example, if I have data at 10:25, 10:32 and call `asof` with 10:30, it'll return 10:25, rather than 10:32 which is actually closer. That is, what I want is the closest time, not necessarily the closest time *before* the time I have given. Maybe I will need to do it using my manual method, but I can't see why `np.abs` won't work. – robintw Feb 27 '13 at 16:22
  • Thanks for the follow-up. I installed the latest development version from http://pandas.pydata.org/pandas-build/dev/ earlier today (about 3pm UTC) - has it changed since then to allow `np.abs`? (I will test with the very latest dev version tomorrow at work) – robintw Feb 27 '13 at 22:50
  • I don't think so. Also looking at the error, it's not in the computation but in the output of the results. If you assign the results to a variable you might be able to poke around and figure out what the issue is – Chang She Feb 28 '13 at 00:48
  • I tried assigning to a variable, but got exactly the same results. Seems rather strange. Should I raise a bug with the pandas team? – robintw Feb 28 '13 at 07:35
  • Yes please do make an issue on github. When you do, can you please provide clearer details, i.e., can you cut your dataset down to a small size and put up a self-contained reproducible example? That will help us debug the issue. Thanks – Chang She Feb 28 '13 at 16:23
5

I was confronting the same problem today. I wanted a function able to give me the closest value prior a given timestamp. Here is the function I got:

def get_nearest_past(data, timestamp):
    index = data.index.get_loc(timestamp,"ffill")
    return data.iloc[index]

In the case that you need the global closest (and not the closest before as in my case), you can use:

def get_nearest(data, timestamp):
    index = data.index.get_loc(timestamp,"nearest")
    return data.iloc[index]

You can find more information in the get_loc documentation.

silgon
  • 6,890
  • 7
  • 46
  • 67