1

My data looks like this:

1516268134      49.95   99.982  49.95   0       0       0       0       0       0       0
1516268134      49.95   99.966  49.95   0       0       0       0       0       0       0
1516268134      49.95   100.28  49.95   0       0       0       0       0       0       0
1516268134      49.95   100.01  49.95   0       0       0       0       0       0       0
1516268134      49.95   100.10  49.95   0       0       0       0       0       0       0
1516268134      49.95   99.773  49.95   0       0       0       0       0       0       0
1516268134      49.95   99.246  49.95   0       0       0       0       0       0       0
1516268134      49.95   144.89  49.95   0       0       0       0       0       0       0
1516268135      49.95   55.700  49.95   0       0       0       0       0       0       0
1516268135      49.95   99.441  49.95   0       0       0       0       0       0       0

2nd, 3rd and 4th columns are floats, the rest are integers. Separator is tab.

I need to take N lines, and calculate min/mean/max values, like

1516268134      49.950  55.700  49.950  0       0       0       0       0       0       0
1516268134      49.950  99.939  49.950  0       0       0       0       0       0       0
1516268135      49.9500 144.890 49.950  0       0       0       0       0       0       0

Again, 2nd, 3rd and 4th columns are floats, the rest need to be integers. Separator is still a tab.

The code looks like this:

import sys
import pandas

file=open(sys.argv[2], "w")

for data in pandas.read_table(sys.argv[1], delim_whitespace=True, header=None, chunksize=int(sys.argv[3])):
  file.write("%d\t%f\t%f\t%f\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n" % (data[0].min(),  data[1].min(),  data[2].min(),  data[3].min(),  data[4].min(),  data[5].min(),  data[6].min(),  data[7].min(),  data[8].min(),  data[9].min(),  data[10].min()))
  file.write("%d\t%f\t%f\t%f\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n" % (data[0].mean(), data[1].mean(), data[2].mean(), data[3].mean(), data[4].mean(), data[5].mean(), data[6].mean(), data[7].mean(), data[8].mean(), data[9].mean(), data[10].mean()))
  file.write("%d\t%f\t%f\t%f\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n" % (data[0].max(),  data[1].max(),  data[2].max(),  data[3].max(),  data[4].max(),  data[5].max(),  data[6].max(),  data[7].max(),  data[8].max(),  data[9].max(),  data[10].max()))

file.close()

I'd like to make the code shorter and look better (& be more easy to understand & maintain).

Tried replacing the 11x data[X].FUNC() with with single data.FUNC() but that gave me error "TypeError: %d format: a number is required, not Series".

The next thing I tried was data.FUNC().convert_objects(convert_numeric=True) but that gave me the same error.

How can I replace

data[0].max(),  data[1].max(),  data[2].max(),  data[3].max(),  data[4].max(),  data[5].max(),  data[6].max(),  data[7].max(),  data[8].max(),  data[9].max(),  data[10].max()

with something short and simple, and keep the float/int format in the data?

I was looking for a solution to convert data.FUNC() to 11 individual numbers, but failed.

-Paavo

2 Answers2

1

First default separator in read_table is tab, so should be omit:

Then use aggregate or apply with aggregate functions in list:

df = pd.read_table('filename', header=None)

df = df.agg(['min','mean','max'])
#Alternatively:
#df = df.apply(['min','mean','max'])

print (df)
                0      1         2      3    4    5    6    7    8    9    10
min   1.516268e+09  49.95   55.7000  49.95  0.0  0.0  0.0  0.0  0.0  0.0  0.0
mean  1.516268e+09  49.95   99.9388  49.95  0.0  0.0  0.0  0.0  0.0  0.0  0.0
max   1.516268e+09  49.95  144.8900  49.95  0.0  0.0  0.0  0.0  0.0  0.0  0.0

Last if need append data to final file use DataFrame.to_csv with mode='a':

df.to_csv('filename1',index=False, header=None, mode='a', sep='\t')

All together:

file=open(sys.argv[2], "w")

for data in pd.read_table(sys.argv[1], header=None, chunksize=int(sys.argv[3])):
    data = data.agg(['min','max','mean'])
    data.to_csv('filename1',index=False, header=None, mode='a', sep='\t')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for fast reply, jezrael 1) I need to maintain the format, so 1516268134 cannot be 1.516268e+09, and 0.0 need to be 0 2) I have Python 2.6.6 (I know...), and agg/aggregate throws AttributeError: 'DataFrame' object has no attribute 'aggregate' -error 3) .apply throws TypeError: ("'list' object is not callable", u'occurred at index 0') error – Paavo Leinonen Jan 18 '18 at 11:48
  • Yes, so `agg` fialed for you, because your pandas version is bellow `0.20.0` I guess. And for second problem I think it is problem for easy solution for mean, max, min per rows for not floats. Because pandas working with all values as matrix and at least one float of matrix convert all values to `float`s. Your solution processes each row separately, so no problem with this. – jezrael Jan 18 '18 at 12:08
0

OK, learned a little more python, the key was to realise that the string % takes a tuple, and then I was able to make the code a lot shorter while keeping the format & functionality.

I ended up with the code below; looks good to me but I'm a true beginner with python, any improvement ideas?

import sys
import pandas

# parameters
# 1: input file, 11 fields separated with tab; 2nd, 3rd and 4th fields are floats
# 2: output file, reduced data in the same format; all other field are integers
# 3: data reduction slice size, N lines reduced to 3 lines with min/mean/max values

file=open(sys.argv[2],"w")
format="%d\t%f\t%f\t%f\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n"

for data in pandas.read_table(sys.argv[1], header=None, chunksize=int(sys.argv[3])):
  file.write(format % tuple(data.min()))
  file.write(format % tuple(data.mean()))
  file.write(format % tuple(data.max()))

file.close()

Not sure what the performance of this is, I kinda suspect that writing a line at a time is slow, but need to do a bit of testing tomorrow with bigger datasets.