1

I have one DataFrame in Python 3.5, such as:

In [1]:tway5new.info()
<class 'pandas.core.frame.DataFrame'>
 Index: 44 entries,  to VOI
 Columns: 43802 entries, 2011-01-01 00:00:00 to 2015-12-31 23:00:00
 dtypes: int64(43802)
 memory usage: 14.7+ MB

And the column name for this Dataframe is:

In [2]:tway5new.columns
Out[2]: 
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
           '2011-01-01 02:00:00', '2011-01-01 03:00:00',
           ...
           '2015-12-31 20:00:00', '2015-12-31 21:00:00',
           '2015-12-31 22:00:00', '2015-12-31 23:00:00'],
          dtype='datetime64[ns]', name='timenew', length=43802, freq=None) 

I want to subset this DataFrame into a sequence of relatively small data frame, i.e. each small data frame only incorporated one typical day record, such as: first small data frame contained information from '2011-01-01 00:00:00' to '2011-01-01 23:00:00', second small data frame contained information from '2011-01-02 00:00:00' to '2011-01-02 23:00:00', ....until 1826th small data frame contained information from '2015-12-31 00:00:00' to '2015-12-31 23:00:00'.

According to @EdChum help, I tried the following code:

df = tway5new.T 
df.resample('d')
demean = lambda x: abs(x - x.mean())
Asub=df.groupby([df.index.year, df.index.month, df.index.day]).transform(demean)
#TO obtain the absolute difference between hours record and average hours record for each column###
Asubmax=Asub.groupby([df.index.year, df.index.month, df.index.day]).max()
AsubmaxID=Asubmax.idxmax(axis=1)

This gave the index of the maximum value for each typical day across columns, i.e, from 2011-01-01 to 2015-12-31:

with pd.option_context('display.max_rows',10,'display.max_columns',6):
    print (AsubmaxID)
2011  1   1     UNF
          2     NAT
          3     NAT
          4     NAT
          5     NAT

2015  12  27    NAT
          28    NAT
          29    NAT
          30    NAT
          31    GOA
dtype: object


BmaxID=Asub.groupby([df.index.year, df.index.month, df.index.day]).idxmax(axis=1)

This gave the index of the maximum value for each hour record across columns, i.e, from 2011-01-01 00:00:00 to 2015-12-31 23:00:00:

with pd.option_context('display.max_rows',10,'display.max_columns',6):
    print (BmaxID)
          timenew            
2011  1   1   2011-01-01 00:00:00    UNF
              2011-01-01 01:00:00    NAT
              2011-01-01 02:00:00    RTF
              2011-01-01 03:00:00    UNF
              2011-01-01 04:00:00    NAT

2015  12  31  2015-12-31 19:00:00    NAT
              2015-12-31 20:00:00    NAT
              2015-12-31 21:00:00    GOA
              2015-12-31 22:00:00    NAT
              2015-12-31 23:00:00    GOA
dtype: object

Now, how to find the index of maximum value for each typical day with exact hour record, for example, I know the index of overall maximum value for the first date 2011-01-01 was UNF column, but how can I obtain exact hour record for this UNF column on 2011-01-01, i.e. which hour record (either 2011-01-01 00:00:00 or 2011-01-01 03:00:00) has the maximum value?

Thanks a lot

TWord
  • 41
  • 5

1 Answers1

0

IIUC you'd be better off making the columns the index and then you can resample or filter on the day:

df = tway5new.T

then you downsample

df.resample('d')

or group on the day:

df.groupby([df.index.year, df.index.month, df.index.day])
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks @EdChum. As you suggested, I found the other solution to do it: `y1=tway5new.filter(like='2011-01-01')` – TWord May 25 '16 at 15:23
  • That doesn't sound very dynamic though as it requires you to pass the filter string each time – EdChum May 25 '16 at 15:25
  • After each small subset of data I obtained, I would conduct the row Mean for such small data frame. So I think I would create the loop to define iteration i from 2011-01-01 to 2015-12-31 – TWord May 25 '16 at 15:30
  • You could achieve the same thing by doing `df.groupby([df.index.year, df.index.month, df.index.day]).mean()` – EdChum May 25 '16 at 15:33
  • Great. I tried your suggestion. I obtained 1826 rows by 44 columns data frame where the element is the average. However, under a typical day, I need to use the repetition of a specific categorical variable (i.e hour record of each column in df) subtract from their corresponding average to obtain the difference, then to find out which column has the absolute largest difference , finally, employ the original the repetition of that column at hour record divide the total repetition of a typical day. These fractions at each typical day are my final outcome. – TWord May 25 '16 at 15:50
  • You should really define your question properly but I think that either `groupby` or `resample` should do what you want – EdChum May 25 '16 at 15:51
  • Therefore, I think maybe the generation of each small subset of data frame one by one is easy to conduct these calculation. You suggestion is quite useful for obtaining the average for each subset of data frame – TWord May 25 '16 at 15:53