5

I would like to get a count for the # of the previous 5 values in df['A'] which are < current value in df['A'] & are also >= df2['A']. I am trying to avoid looping over every row and columns because I'd like to apply this to a larger data set.

Given this...

list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))

I would like to return this (solved in Excel with COUNTIFS)...

enter image description here

The line below achieves the first part (thanks Alexander), and Divakar and DSM have also weighed in previously (here and here).

df3 = pd.DataFrame(df.rolling(center=False,window=6).apply(lambda rollwin: sum((rollwin[:-1] < rollwin[-1]))))

But I am unable to to add the comparison to df2. Please help.

FOLLOW UP on 10/27/16:

How would I write the lambda above as a standard function?

10/28/16:

See below, taking col 'A' from both df and df2, I am trying to count how many of the previous 5 values from df['A'] fall between the current df2['A'] and df['A']. Said differently, how many from each orange box fall between the yellow low-high range?

enter image description here

UPDATE: different list1 data produces incorrect df3...

list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[26,108],[25,102],[26,106],[25,111],[22,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))

df3 = pd.DataFrame(
     df.rolling(center=False,window=6).apply(
          lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))

df
Out[9]: 
             A    B
2000-01-01  21  101
2000-01-02  22  110
2000-01-03  25  113
2000-01-04  24  112
2000-01-05  21  109
2000-01-06  26  108
2000-01-07  25  102
2000-01-08  26  106
2000-01-09  25  111
2000-01-10  22  110


df3
Out[8]: 
              A    B
2000-01-01  NaN  NaN
2000-01-02  NaN  NaN
2000-01-03  NaN  NaN
2000-01-04  NaN  NaN
2000-01-05  NaN  NaN
2000-01-06  1.0  0.0
2000-01-07  2.0  0.0
2000-01-08  3.0  1.0
2000-01-09  2.0  3.0
2000-01-10  1.0  3.0

EXCEL EXAMPLES (11/14): see below, trying to count how many numbers in the blue box fall between the range highlighted in orange.

enter image description here

Community
  • 1
  • 1
MJS
  • 1,573
  • 3
  • 17
  • 26
  • 2
    Your `df2` is not reproducible. `df_data` is not defined here. – Psidom Oct 26 '16 at 20:57
  • fixed Psidom. Thanks. – MJS Oct 26 '16 at 20:58
  • Will we tidy this question and answer up? – Dickster Nov 25 '16 at 21:04
  • Sorry is there something I did not do? I upvoted both of your answers and selected the second as it's more concise for others to look at. Great answers, very much appreciate your time. – MJS Nov 28 '16 at 18:31
  • On a separate note, I don't find the docs on manipulating df.rolling() objects to be as fleshed out as many other topics. It would be useful to replace the 0.95 below with another derivation of the same rolling data. – MJS Nov 28 '16 at 18:44

2 Answers2

2
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))


window = 6
results = []
for i in range (len(df)-window+1):
    slice_df1 = df.iloc[i:i + window]
    slice_df2 = df2.iloc[i:i + window]
    compare1 = slice_df1['A'].iloc[-1]
    compare2 = slice_df2['A'].iloc[-1]
    a= slice_df1.iloc[:-1]['A'].between(compare2,compare1)  # series have a between metho
    results.append(a.sum())

df_res =  pd.DataFrame(data = results , index = df.index[window-1:] , columns = ['countifs'])
df_res = df_res.reindex(df.index,fill_value=0.0)
print df_res

which yields:

            countifs
2000-01-01    0.0000
2000-01-02    0.0000
2000-01-03    0.0000
2000-01-04    0.0000
2000-01-05    0.0000
2000-01-06    0.0000
2000-01-07    0.0000
2000-01-08    1.0000
2000-01-09    1.0000
2000-01-10    0.0000

BUT

Seeing there is a logical relationship between your upper and lower bound, value and value - 5%. Then this will perhaps be what you want.

    df3 = pd.DataFrame(
         df.rolling(center=False,window=6).apply(
            lambda rollwin: sum(np.logical_and(
                                    rollwin[-1]*0.95 <= rollwin[:-1]
                                   ,rollwin[:-1] < rollwin[-1]) 
                                )))

and if you prefer the pd.Series.between() approach:

df3 = pd.DataFrame(
     df.rolling(center=False,window=6).apply(
          lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))
Dickster
  • 2,969
  • 3
  • 23
  • 29
  • thanks Dickster, really appreciate your input. While I am working through your solution, I added some color to my question above - probably didn't ask clearly enough. – MJS Oct 28 '16 at 20:56
  • works perfectly, thanks again. i believe the .between() approach outperforms the first method. i am going to ask a follow up as a new question. – MJS Oct 31 '16 at 14:22
  • cool point me at that - and credit on features shown by the others on the thread that are implemented. – Dickster Nov 03 '16 at 14:38
  • hi Dickster - after further use, something appears to be off. When I replace the original 'list1' with list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[26,108],[25,102],[26,106],[25,111],[22,110]], the output with the Series.between() solution does not appear correct. Any chance you can take a look? – MJS Nov 08 '16 at 17:48
  • I don't see the issue using the new data. What is it that you think is wrong? – Dickster Nov 09 '16 at 21:38
  • I posted the new data in the question above. df3.ix['2000-01-06','A'] should not = 1.0 because 26 is not between any of the 5 prior low:high pairs (x*0.95:x). – MJS Nov 09 '16 at 22:08
  • I think there may be a disconnect in my solution and how I have perceived the question and how you perceive the solution should be working. I wonder can you make an excel spreadsheet available with a couple of examples and your expected behaviour? – Dickster Nov 13 '16 at 23:31
  • i took a snapshot of a spreadsheet with 3 examples and pasted above. if there is a way to attach the sheet itself with formulas, i would be happy to do but don't know how. thanks again for looking. – MJS Nov 14 '16 at 22:01
  • see new answer. I recognize your list in the comments (2 columns) is different from the new screen shots. – Dickster Nov 16 '16 at 22:40
2
list1 = [[21,50,101],[22,52,110],[25,49,113],[24,49,112],[21,55,109],[28,54,108],[30,57,102],[26,56,106],[25,58,111],[24,60,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('ABC'))

print df

I believe this matches your new screen shot "Given Data".

             A   B    C
2000-01-01  21  50  101
2000-01-02  22  52  110
2000-01-03  25  49  113
2000-01-04  24  49  112
2000-01-05  21  55  109
2000-01-06  28  54  108
2000-01-07  30  57  102
2000-01-08  26  56  106
2000-01-09  25  58  111
2000-01-10  24  60  110

enter image description here

and the same function:

print pd.DataFrame(
           df.rolling(center=False,window=6).
              apply(lambda rollwin: pd.Series(rollwin[:-1]).
                   between(rollwin[-1]*0.95,rollwin[-1]).sum()))

gives your desired output "Desired outcome":

             A   B   C
2000-01-01 nan nan nan
2000-01-02 nan nan nan
2000-01-03 nan nan nan
2000-01-04 nan nan nan
2000-01-05 nan nan nan
2000-01-06   0   1   0
2000-01-07   0   1   0
2000-01-08   1   2   1
2000-01-09   1   2   3
2000-01-10   0   2   3

enter image description here

Dickster
  • 2,969
  • 3
  • 23
  • 29
  • it works. thank you so much. toggling between inclusive=True/False is important. – MJS Nov 18 '16 at 19:53