1

I am using python with panda data frame. I have a data frame which is imported from a CSV file.

         volume  temperature(c)
time(sec)
1000.1  10.4   26.5
1000.2  12.5   30.2
1000.3  13.2   40.5
.
.
.
8000.1  78   50.8
8000.2  79   51.5

I want to create a new data frame that we define a time window W (for example 5 sec) and for every W sec will aggregate the value of each column to one row with the different calculation on the specific window, for example, mean, std z-score etc. example for output data frame :

time(sec) mean_volume mean_temperature std_volume
1000.1  12.0.  32.4 1.4
1005.1  12.5   30.2 1.7
1010.1  11.7   30.1 1.5
.
.
.

I familiar with df['new col'] = data['source'].rolling(W).mean(), this is not the solution in my case I attached example

    T,H,L,C,label
1000.1,23.18,27.272,426,1
1000.2,23.15,27.2675,429.5,1
1000.3,23.15,27.245,426,1
1000.4,23.15,27.2,426,1
1000.5,23.1,27.2,426,1
1000.6,23.1,27.2,419,1
1000.7,23.1,27.2,419,1
1000.8,23.1,27.2,419,1
1000.9,23.1,27.2,419,1
1001,23.075,27.175,419,1
1001.1,23.075,27.15,419,1
1001.2,23.1,27.1,419,1
1001.3,23.1,27.16666667,419,1
1001.4,23.05,27.15,419,1
1001.5,23,27.125,419,1
1001.6,23,27.125,418.5,1
1001.7,23,27.2,0,0
1001.8,22.945,27.29,0,0
1001.9,22.945,27.39,0,0
1002,22.89,27.39,0,0
1002.1,22.89,27.39,0,0
1002.2,22.89,27.39,0,0
1002.3,22.89,27.445,0,0

For the above example I want the new data-frame will contains the following columns: H_mean,H_std ,L_mean,C_mean,L_std,C_std

In addition, how I can apply custom function on each segment (for example z-score).

Thanks,

MAK
  • 605
  • 9
  • 19

1 Answers1

2

Given that your data is in a pd.DataFrame called df, the following should do the trick:

import pandas as pd
import numpy as np
step = 5
df.groupby(pd.cut(df.index,
                 np.arange(start=df.index.min(), stop=df.index.max(), step=step, 
                 dtype=float)))\
           .agg({'volume':['mean', 'std'], 'temperature':['mean']})

We are using pd.cut to create an IntervalIndex which we can groupby. Finally we use pd.DataFrame.agg to calculate summary statistics for each group; mean and std for the volume column and just mean for the temperature column.

I haven't tested this but if you provide a minimal, complete and verifiable example, I can do that.

EDIT

Given the updated data, I have written the code below:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from io import StringIO

In [4]: s = """T,H,L,C,label
   ...: 1000.1,23.18,27.272,426,1
   ...: 1000.2,23.15,27.2675,429.5,1
   ...: 1000.3,23.15,27.245,426,1
   ...: 1000.4,23.15,27.2,426,1
   ...: 1000.5,23.1,27.2,426,1
   ...: 1000.6,23.1,27.2,419,1
   ...: 1000.7,23.1,27.2,419,1
   ...: 1000.8,23.1,27.2,419,1
   ...: 1000.9,23.1,27.2,419,1
   ...: 1001,23.075,27.175,419,1
   ...: 1001.1,23.075,27.15,419,1
   ...: 1001.2,23.1,27.1,419,1
   ...: 1001.3,23.1,27.16666667,419,1
   ...: 1001.4,23.05,27.15,419,1
   ...: 1001.5,23,27.125,419,1
   ...: 1001.6,23,27.125,418.5,1
   ...: 1001.7,23,27.2,0,0
   ...: 1001.8,22.945,27.29,0,0
   ...: 1001.9,22.945,27.39,0,0
   ...: 1002,22.89,27.39,0,0
   ...: 1002.1,22.89,27.39,0,0
   ...: 1002.2,22.89,27.39,0,0
   ...: 1002.3,22.89,27.445,0,0"""

In [5]: df = pd.read_csv(StringIO(s), index_col='T')

Again we use an IntervalIndex and groupby, as well as agg to calculate summary statistics.

In [6]: step = 0.5
    ...: 
    ...: grouped = df.groupby(pd.cut(df.index,
    ...:                  np.arange(start=df.index.min(), stop=df.index.max(), step=step, dtype=float
    ...: )))
    ...: 

In [7]: grouped.agg({'H':['mean', 'std'], 'L':['mean', 'std'], 'C':['mean', 'std']})
Out[7]: 
                       H                    L                C          
                    mean       std       mean       std   mean       std
(1000.1, 1000.6]  23.130  0.027386  27.222500  0.031820  425.3  3.834058
(1000.6, 1001.1]  23.090  0.013693  27.185000  0.022361  419.0  0.000000
(1001.1, 1001.6]  23.050  0.050000  27.133333  0.025685  418.9  0.223607
(1001.6, 1002.1]  22.934  0.046016  27.332000  0.085557    0.0  0.000000

This does not give you the column names you wanted, so let's flatten the column MultiIndex to adjust those.

In [8]: aggregated = grouped.agg({'H':['mean', 'std'], 'L':['mean', 'std'], 'C':['mean', 'std']})

In [9]: ['_'.join(col).strip() for col in aggregated.columns.values]
Out[9]: ['H_mean', 'H_std', 'L_mean', 'L_std', 'C_mean', 'C_std']

In [10]: aggregated.columns = ['_'.join(col).strip() for col in aggregated.columns.values]

In [11]: aggregated
Out[11]: 
                  H_mean     H_std     L_mean     L_std  C_mean     C_std
(1000.1, 1000.6]  23.130  0.027386  27.222500  0.031820   425.3  3.834058
(1000.6, 1001.1]  23.090  0.013693  27.185000  0.022361   419.0  0.000000
(1001.1, 1001.6]  23.050  0.050000  27.133333  0.025685   418.9  0.223607
(1001.6, 1002.1]  22.934  0.046016  27.332000  0.085557     0.0  0.000000

It's not quite clear to me what you mean by applying Z-scores, because that's not a summary statistic, unlike std and mean, so it doesn't play well with agg. If you just want to apply Z-scores to your entire DataFrame by column, I would suggest, you might want to take a look at this question: Pandas - Compute z-score for all columns

tobsecret
  • 2,442
  • 15
  • 26
  • Hii, I update my question , according to your request. Your solution give only partial solution to my question. Thanks – MAK Jul 20 '18 at 05:29
  • added edits, lemme know if I understood you correctly. Your edits to your question were good, you got the complete and minimal part of the MCVE but the V which stands for verifiable is missing. If you add what your expected output is, I can help you further. – tobsecret Jul 20 '18 at 15:44