I try to manipulate a large CSV file using Pandas, when I wrote this
df = pd.read_csv(strFileName,sep='\t',delimiter='\t')
it raises "pandas.parser.CParserError: Error tokenizing data. C error: out of memory" wc -l indicate there are 13822117 lines, I need to aggregate on this csv file data frame, is there a way to handle this other then split the csv into several files and write codes to merge the results? Any suggestions on how to do that? Thanks
The input is like this:
columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]
0:'3M' '2345' '2345' '2014-10-5',3000
1:'3M' '2958' '2152' '2015-3-22',5000
2:'GE' '2183' '2183' '2012-12-31',515
3:'3M' '2958' '2958' '2015-3-10',395
4:'GE' '2183' '2285' '2015-4-19',1925
5:'GE' '2598' '2598' '2015-3-17',1915
And the desired output is like this:
columns=[ka,kb,errorNum,errorRate,totalNum of records]
'3M','2345',0,0%,1
'3M','2958',1,50%,2
'GE','2183',1,50%,2
'GE','2598',0,0%,1
if the data set is small, the below code could be used as provided by another
df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum': 'sum',
'recordNum': 'count' })
df2['errorRate'] = df2['errorNum'] / df2['recordNum']
ka kb_1 recordNum errorNum errorRate
3M 2345 1 0 0.0
2958 2 1 0.5
GE 2183 2 1 0.5
2598 1 0 0.0
(definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)