1

I took a whole afternoon trying to implement this task but failed ,I've got a pandas data frame like this

columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]
0:'3M' '2345' '2345' '2014-10-5',3000
1:'3M' '2958' '2152' '2015-3-22',5000
2:'GE' '2183' '2183' '2012-12-31',515
3:'3M' '2958' '2958' '2015-3-10',395
4:'GE' '2183' '2285' '2015-4-19',1925
5:'GE' '2598' '2598' '2015-3-17',1915

What is to be implemented is a new data frame grouped by "ka and kb_1" below

columns=[ka,kb,errorNum,errorRate,totalNum of records]
'3M','2345',0,0%,1
'3M','2958',1,50%,2
'GE','2183',1,50%,2
'GE','2598',0,0%,1

(definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)

My code is like this

df['isError'] = (df['kb_1'] != df['kb_2']).astype('int')
grouped2 = df.groupby(['ka', 'kb_1'])

df_rst = pd.DataFrame()
df_rst['ka']  =grouped2['ka'].all()
df_rst['kb_1'] = grouped2['kb_1'].all()
df_rst['errorNum'] = grouped2['isError'].transform(sum)
df_rst['totalNum of records'] = grouped2.size()
df_rst['Soll_neq_Letzt_error_rate'] = df_rst['errorNum'].astype('float').div(df_rst['totalNum'].astype('float'), axis='index')
df_rst.to_csv('rst.csv',index=False)

but the result is not what I wanted.

For instance, the column kb_1 becomes true/false, and errorNum becomes Nan. Can anyone explain why and give an workable implementation? Thanks

sunxd
  • 743
  • 1
  • 9
  • 24
  • Please show sample data and desired results. Saying the results are not what you wanted doesn't really tell us what you do want. – JohnE May 14 '15 at 17:14
  • Thanks, I have just added the input and desired output – sunxd May 14 '15 at 17:45

1 Answers1

1

I'm not sure exactly what you did, but I don't think you were that far off.

df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum':  'sum',
                                                 'recordNum': 'count' })

df2['errorRate'] = df2['errorNum'] / df2['recordNum']

         recordNum  errorNum  errorRate
ka kb_1                                
3M 2345          1         0        0.0
   2958          2         1        0.5
GE 2183          2         1        0.5
   2598          1         0        0.0
JohnE
  • 29,156
  • 8
  • 79
  • 109
  • wow,thanks,can't image it just takes two lines to complete, as I was still thinking of joining the table. – sunxd May 14 '15 at 19:13
  • Glad that helps, please remember to click the checkmark if you are reasonably happy with the answer. – JohnE May 14 '15 at 21:13
  • done, I met another difficulty in achieving this because the csv I try to manipulate too big to put into memory, see http://stackoverflow.com/questions/30245640/pandas-read-csv-out-of-memory – sunxd May 15 '15 at 07:21