Grouping pandas dataframe into groups based on a repeating sequence of values in one column

Question

I have 2 columns in my dataframe: x and y. x is continually repeating between 1-4 and I need to find out some statistics about the sections where x=2, e.g. mid-point and average etc. I have created a third column using .shift(-1):

 df['x_shift']=df['x'].shift(-1)

x  y   x_shift 
1  25  1
1  25  1
1  25  2
2  25  2
2  45  2
2  90  3
3  90  3
3  90  3
3  90  4
4  60  4
4  40  1 
1  25  1
1  25  1
1  25  2
2  43  2
2  66  2
2  77  2
2  90  3
3  90

Using this, I have identified the points where x changes from 1 to 2 and from 2 to 3 to mark the start and end of where I need the data:

   start point =     df.ix[(df['x']==2) & (df['x_shift']==2)] 
   final point = df.ix[df['x']==2) & (df['x_shift']==3)]

I have tried to create groups to generate statistics, but I wasn't sure how to include the above within groupby:

 grouped = df.groupby(     )

The intention is to use grouped.describe() to generate statistics, which I am hoping that I will be able to also extract and plot?

I'm now at a stage where I am using the groupby function of pandas, and I want to group each section of 2222 within the 11122223333444411112222333444 continuous series to then use the grouped.describe() to give me statistical information for each of the sections of 2222. Does anyone have any ideas on how I can group all these separated sections of 2222 together? — AM94, Aug 01 '17 at 15:05

score 0 · Answer 1 · answered Jul 27 '17 at 16:59

I'm not familiar with Pandas, but splitting the data shouldn't be too difficult. It doesn't look like the characters repeat regularly. To split them, I would iterate over the data, and compare the current data point to the previous. Assuming it always goes from 4 to 1, potential code would look like this:

for a in range(1,len(x)):
    if x[a] == 1 and x[a-1] == 4:
        ##Separate data depending on the format you want it in

As for finding the midpoint and gradient, I'm guessing these are Panda specific, but you can identify cases where the data in column x is equal to 2 with a for loop and if statement:

for b in x:
    if b == 2:
        ##Return midpoint, gradient in column y

Thanks for this, I can't seem to get the code for "if the value of the row is x and the row below is y" correct within the dataframe set up. — AM94, Aug 01 '17 at 08:03

piRSquared · Accepted Answer · 2017-08-01T15:52:08.057

0

Option 1
Quick approach where x == 2

df.query('x == 2').y.describe()

count     7.000000
mean     62.285714
std      25.256777
min      25.000000
25%      44.000000
50%      66.000000
75%      83.500000
max      90.000000
Name: y, dtype: float64

Option 2
More complete

df.groupby('x').y.describe()

   count       mean        std   min   25%   50%   75%   max
x                                                           
1    6.0  25.000000   0.000000  25.0  25.0  25.0  25.0  25.0
2    7.0  62.285714  25.256777  25.0  44.0  66.0  83.5  90.0
3    4.0  90.000000   0.000000  90.0  90.0  90.0  90.0  90.0
4    2.0  50.000000  14.142136  40.0  45.0  50.0  55.0  60.0

Option 3
More specific

df.groupby('x').y.agg(['mean', 'median', 'max', 'min', 'count', 'size'])

        mean  median  max  min  count  size
x                                          
1  25.000000      25   25   25      6     6
2  62.285714      66   90   25      7     7
3  90.000000      90   90   90      4     4
4  50.000000      50   60   40      2     2

Option 4
Split into the different contiguous groups

df.groupby(
    ['x', df.x.ne(df.x.shift()).cumsum().rename('grp')]
).y.describe().loc[2]

     count       mean        std   min    25%   50%    75%   max
grp                                                             
2      3.0  53.333333  33.291641  25.0  35.00  45.0  67.50  90.0
6      4.0  69.000000  19.916492  43.0  60.25  71.5  80.25  90.0

Or

df.groupby(
    ['x', df.x.ne(df.x.shift()).cumsum().rename('grp')]
).y.agg(['mean', 'median', 'max', 'min', 'count', 'size']).loc[2]

          mean  median  max  min  count  size
grp                                          
2    53.333333    45.0   90   25      3     3
6    69.000000    71.5   90   43      4     4

edited Aug 01 '17 at 15:52

answered Jul 27 '17 at 17:02

piRSquared

285,575
57
475
624

Sorry I'm not completely clear on how I can use this function when the pattern of my first column is not repeating regularly. Maybe I'm missing something.... – AM94 Aug 01 '17 at 08:05
@AM94 edit your question to include a more representative example – Paul H Aug 01 '17 at 15:17
@PaulH I've tried to update it with what stage its at currently, cheers! – AM94 Aug 01 '17 at 15:41
@AM94 I've updated my post. Hopefully that is more helpful. – piRSquared Aug 01 '17 at 15:47
@PaulH Cheers for this! I think the confusing part is that I don't want the overall average when x=2, i want it for each section of 1234 repeats independently so I don't want to group by x, I'd need to group all 1234 repeats into separate groups if that makes sense? Sorry if i'm not being very clear! – AM94 Aug 01 '17 at 15:56
@AM94 two things... one, If you look at **Option 4** that is what you'll see. two, it would be tremendously helpful if you were to calculate what you think the results should be and edit your post to reflect the format you'd like to see those results. This would remove any ambiguity and we who answer could get straight to giving you the answer. three (I know I said two), after that's done, I suspect you'll see that I've provided you what you asked for. – piRSquared Aug 01 '17 at 16:02
@piRSquared Thanks a lot - this worked perfectly! Will apply it to a larger dataset and plot it up now – AM94 Aug 01 '17 at 16:15

Grouping pandas dataframe into groups based on a repeating sequence of values in one column

2 Answers2