2

I have a historical autogenerated logfile with the following format that I would like to convert to a csv file prior to uploading to a database

--------------------------------------
Thu Jul  8 09:34:12 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 312 cps/MBq
--------------------------------------
Thu Jul  8 09:34:55 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 318 cps/MBq
--------------------------------------
Thu Jul  8 10:13:39 BST 2010
RED Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 307 cps/MBq
--------------------------------------
Thu Jul  8 10:14:10 BST 2010
RED Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 305 cps/MBq
--------------------------------------
Mon Jul 19 10:11:18 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 326 cps/MBq
--------------------------------------
Mon Jul 19 10:12:09 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 333 cps/MBq
--------------------------------------
Mon Jul 19 10:13:57 BST 2010
RED Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 338 cps/MBq
--------------------------------------
Mon Jul 19 10:14:45 BST 2010
RED Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 340 cps/MBq
--------------------------------------

I would like to convert the logfile to the following format

Date,Camera,Head,Duration,Activity
08/07/10,BLUE,1,20,14.9
08/07/10,BLUE,1,20,14.9
08/07/10,RED,1,20,14.9
08/07/10,RED,1,20,14.9

I have used awk to get me close to what I wish

awk 'BEGIN {print "Date,Camera,Head,Duration,Activity";RS = "--------------------------------------"; FS="\n";}; {OFS=",";split($3, a, " ");split($4,b, " "); split($5,c," ");print $2,a[1],a[3],b[3],c[3]}' sensitivity.txt > sensitivity.csv

which gives me

Date,Camera,Head,Duration,Activity
,,,,
Thu Jul  8 09:34:12 BST 2010,BLUE,1,20,14.9
Thu Jul  8 09:34:55 BST 2010,BLUE,1,20,14.9
Thu Jul  8 10:13:39 BST 2010,RED,1,20,14.9
Thu Jul  8 10:14:10 BST 2010,RED,1,20,14.9

How can I

(a) get rid of the 4 output field separators in line 4 (b) Convert the date format from Thu Jul 8 09:34:12 BST 2010 to DD/MM/YY (Can I do this in pure awk or by piping to perl)

moadeep
  • 3,988
  • 10
  • 45
  • 72
  • For the date conversion, look at the first answer here http://stackoverflow.com/questions/2121896/converting-dates-in-awk – marmottus Apr 10 '13 at 10:30
  • And for the useless commas, just check the values of $2, a, b, c, etc... (if ($2) { print ... }) – marmottus Apr 10 '13 at 10:34
  • 4
    I don't think I've seen anyone ask to convert 4-digit years to 2-digit years in their dates since 2000 rolled along. Seriously consider using a YYYYMMDD date format so you can distinguish 1999 from 2099 and trivially sort your data by date. – Ed Morton Apr 10 '13 at 12:46

3 Answers3

2

@sudo_O's answer is fine but here's an alternative:

$ cat tst.awk
BEGIN{ RS="---+\n"; OFS=","; months="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR==1{ print "Date","Camera","Head","Duration","Activity"; next }
{ print sprintf("%04d%02d%02d",$6,(match(months,$2)+2)/3,$3),$7,$9,$12,$16 }

$ gawk -f tst.awk file
Date,Camera,Head,Duration,Activity
20100708,BLUE,1,20,14.9
20100708,BLUE,1,20,14.9
20100708,RED,1,20,14.9
20100708,RED,1,20,14.9
20100719,BLUE,1,20,12.4
20100719,BLUE,1,20,12.4
20100719,RED,1,20,12.4
20100719,RED,1,20,12.4

Note that I used GNU awk above so I could set the RS to more than a single character. With other awks just convert all the "---..."s lines to a blank line or control character or something and set RS accordingly before running the script.

If you don't like my suggested date format, tweak the sprintf() to suit.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

This straight forward awk script will do the job:

BEGIN {
    n=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",month,"|")
    for (i=1;i<=n;i++) {
        month_index[month[i]] = i
    }
    print "Date,Camera,Head,Duration,Activity"  
}
/^-*$/{
    i=0
    next
}
{
    i++
}
i==1{
    printf "%02d/%02d/%02d,",$3,month_index[$2],substr($6,3)
}
i==2{
    printf "%s,%d,",$1,$3
}
i==3{
    printf "%d,",$3
}
i==4{
    printf "%.1f\n",$3
}

Outputs:

$ awk -f script.awk file
08/07/10,BLUE,1,20,14.9
08/07/10,BLUE,1,20,14.9
08/07/10,RED,1,20,14.9
08/07/10,RED,1,20,14.9
19/07/10,BLUE,1,20,12.4
19/07/10,BLUE,1,20,12.4
19/07/10,RED,1,20,12.4
19/07/10,RED,1,20,12.4
Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
  • Your month array will contain 24 entries instead of 12. It'll work as long as the OP doesn't want to print all of the months. Consider using 2 arrays for monthNr2Nm and monthNm2Nr. – Ed Morton Apr 10 '13 at 12:49
  • @EdMorton I know, I was just being lazy, I don't see the OP wanting to iterate over the array but changed it just in case. – Chris Seymour Apr 10 '13 at 12:53
  • @sudo_O - agreed but other people looking for answers to "similar" questions might see it. Mostly I think it helps with clarity if you have 2 separate arrays, even if the one used in the split is just named "tmp" or something. – Ed Morton Apr 10 '13 at 13:13
1

I figured I would show how to actually parse the input, rather than just performing string transformations.

#! /usr/bin/env perl
use strict;
use warnings;
use Date::Parse;
use Date::Format;
use Text::CSV;

sub convert_date{
  my $time = str2time($_[0]);
  # iso 8601 style:
  return time2str('%Y-%m-%d',$time); # YYYY-MM-DD

  # or the outdated style output you wanted
  return time2str('%d/%m/%y',$time); # DD/MM/YY
}

my %multiply_table = (
  s => 1,
  m => 60,
  h => 60 * 60,
  d => 60 * 60 * 24,
);
sub convert_duration{
  my($d,$s) = $_[0] =~ /^ \s* (\d+) \s* (\w) \s* $/x;
  die "Invalid duration '$_[0]'" unless $d && $s;
  return $d * $multiply_table{$s};
}

my @field_list = qw'Date Camera Head Duration Activity';

my $csv = Text::CSV->new( { eol => "\n" } );

# print header
$csv->print( \*STDOUT, \@field_list );

# set record separator
local $/ = ('-' x 38) . "\n";

# parse data
while(<>){
  chomp; # remove record separator
  next unless $_; # skip empty section
  my($time,$camdat,@fields) = split m/\n/; # split up the fields

  my %data;


  # split camera and head fields
  @data{qw(Camera Head)} = split /\s+Head\s+/, $camdat;

  # parse lines like:
  #   Duration = 20 s
  #   Activity = 14.9 MBq
  #   Sensitivity = 305 cps/MBq
  for(@fields){
    my($key,$value) = /(\w+) \s* = \s* (.*) /x;
    $data{$key} = $value;
  }

  # at this point we start reducing precision

  $data{Date} = convert_date( $time );

  # remove measurement units
  $data{Duration} = convert_duration($data{Duration}); # safe
  $data{Activity} =~ s/[^\d]*$//; # unsafe

  $csv->print(\*STDOUT, [@data{@field_list}]);
}
Brad Gilbert
  • 33,846
  • 11
  • 78
  • 129