2

As the title suggests I have data stored in multiple flat files in the following format:

215,,,215,16.4,0,2011/05/11 00:00:06
215,,,215,16.3,0,2011/05/11 00:00:23
217,,,217,16.3,0,2011/05/11 00:00:11
213,,,213,16.3,0,2011/05/11 00:00:17
215,,,215,16.3,0,2011/05/11 00:00:30

I am currently using the following awk command:

awk -F ',' '{gsub(/[\/:]/," ",$7); print mktime($7)":"$1":"$5}' MyFile

That gives me the output as follows (date converted to epoch, colon separator and moved around a little):

1305068406:215:16.4
1305068430:215:16.3
1305068411:217:16.3
1305068417:213:16.3
1305068423:215:16.3

The input file may not be in date order due to some hiccups when the file was being written, so next I pipe the output of the awk command above into a sort -n which will ensure the data is sorted numerically with the oldest epoch time at the top.

1305068406:215:16.4
1305068411:217:16.3
1305068417:213:16.3
1305068423:215:16.3
1305068430:215:16.3

I am then piping the sorted output into another awk command:

awk -F ':' 'BEGIN {ORS=" ";c="rrdtool update ccdata2.rrd"; print c} NR % 100 == 0 {print "&& "c} $1>p {print $0;p=$0}'

This generates the output below, and ensures several rules:

  • Every 100 records, prints a && and a new rrdtool update ccdata.rrd prefix (it doesent seem that rrdtool likes an update with a lot of records)
  • Only prints out an rrd data line if the epoch time is greater than the last

The final output is as follows:

rrdtool update ccdata2.rrd 1305068406:215:16.4 1305068411:217:16.3 1305068417:213:16.3 1305068423:215:16.3 1305068430:215:16.3

If there are 300 records it would be (you get the idea)

rrdtool update ccdata2.rrd x:x:x <100 times> && rrdtool update ccdata2.rrd x:x:x <another 100 times>

I am then piping the output of the command to bash in order for the shell to execute the output rrdtool update command.

The full command is:

awk -F ',' '{gsub(/[\/:]/," ",$7); print mktime($7)":"$1":"$5}' MyFile | sort -n | awk -F ':' 'BEGIN {ORS=" ";c="rrdtool update ccdata2.rrd"; print c} NR % 100 == 0 {print "&& "c} $1>p {print $0;p=$0}' | bash

How could the above process be improved ? How would you achieve the same thing ? Please state why in your answer. (i.e. could the two awk commands be converted into one)

general exception
  • 4,202
  • 9
  • 54
  • 82

1 Answers1

3

Since the data only contains [0-9:.] and newlines, xargs should be safe to use (for once), so you can lose the second awk and do:

awk -F ',' '{gsub(/[\/:]/," ",$7); print mktime($7)":"$1":"$5}' MyFile | 
sort -n | 
xargs rddtool update ccdata2.rrd

xargs will squeeze as many arguments as it can to the rddtool command, and if the number of arguments will make it exceed ARG_MAX, it will run more commands, until all input has been processed.

EDIT:

To have the functionality of only printing out a line if the epoch date is greater than the last, I have updated the awk command to the following:

awk -F ',' '{gsub(/[\/:]/," ",$7)} $7>p {print mktime($7)":"$1":"$5;p=$7}' MyFile |
sort -n | 
xargs rddtool update ccdata2.rrd
general exception
  • 4,202
  • 9
  • 54
  • 82
geirha
  • 5,801
  • 1
  • 30
  • 35
  • 1
    Can you explain why `xargs` would be safe to use here, and not anywhere else as implied ? Thanks. – general exception Jul 14 '12 at 19:09
  • 3
    @generalexception, xargs split the input into words, by whitespace and quotes, in this case you know there won't be any quotes, and no whitespace within each word, so it's safe. A common mistake people do is combining it with find, like `find|xargs something`, which works "fine"... until there happens to be a filename with, for example, an apostrophe or a space. – geirha Jul 14 '12 at 19:14
  • @geirha And then the -0 option of xargs becomes handy… In his case he could use it combined with the OFS set to "\000" in awk, I suppose. – Stéphane Feb 27 '17 at 02:45