4

My input file has its content in following format, where each column is separated by a "space"

string1<space>string2<space>string3<space>YYYY-mm-dd<space>hh:mm:ss.SSS<space>string4<space>10:1234567890<space>0e:Apple 1.2.3.4<space><space>string5<space>HEX  

There are 2 "spaces" after "0e:Apple 1.2.3.4" because there is no 14th digit in this field/column. The entire "0e:Apple 1.2.3.4space" is treated as a single value of that column.

In the 7th column, 10: represents the count of characters in the following string.

In the 8th column, 0e: represents a hex value of 14. So, the HEX values mention the count of characters in the string that follows.

Like:

"0e:Apple 1.2.3.4 "--> this is the actual value in 8th column without " "  
    (I've mentioned " " to show that the 14th digit is empty)  

It's counted as  
0e:A p p l e   1 . 2 .   3  . 4    
   | | | | | | | | | |   |  | | |  
   1 2 3 4 5 6 7 8 9 10 11 12 1314  

Let's consider first row from the input file as:

string1 string2 string3 yyyy-mm-dd 23:50:45.999 string4 10:1234567890 0e:Apple 1.2.3.4  string5 001e  

where:

  • string1 is the value in 1st column
  • string2 is the value in 2nd column
  • string3 is the value in 3rd column
  • yyyy-mm-dd in 4th
  • 23:50:50.999 in 5th
  • string3 in 6th
  • 10:1234567890 in 7th //there is no space at the end because it has 10 digits
  • 0e:Apple 1.2.3.4 in 8th //space at the end
  • string5 in 9th
  • 001e in 10th

Expected output:

string1,string2,string3,yyyy-mm dd,23:50:50.999,string3,1234567890,Apple_1.2.3.4,string5,30  

Requirements:

  1. Eliminate the counts from 7th and 8th column (10: & 0e:)
  2. The space b/w Apple and 1.2.3.4 should be replace by "_"
  3. Hex value in the last column should be converted to decimal value.
  4. Replace the "space" between columns with ","
  5. I've used hex value only in 10th column here. What if it's in several columns? Any way to convert it specific to certain columns?

I've tried using this:

$ cat input.txt |sed 's/[a-z0-9].*://g'  

which gives output as:

string1,string2,string3,yyyy-mm-dd,45.999,string4,1234567890,Apple,1.2.3.4,,string5,001e  
mklement0
  • 382,024
  • 64
  • 607
  • 775
intruder
  • 417
  • 1
  • 3
  • 18
  • 1
    Are you sure you mean *preceding*? – Michael Vehrs May 26 '16 at 05:34
  • So, basically, you have not tried to do anything yourself. The `sed` example you posted is obviously unsuitable for your requirements (except the first, possibly). And `sed` is not powerful enough for what you are trying to do. A `sed` guru would probably be able to write a two hundred line program to solve the problem, but it would be insanely difficult. – Michael Vehrs May 26 '16 at 05:43
  • @MichaelVehrs my bad. It's following string. Edited it. Thanks! :) – intruder May 26 '16 at 19:30
  • @MichaelVehrs yeah. My script just does the first. I'm able to do 4 with other script too. But not sure how to proceed with 2,3 and 5. Can we do it with awk? (Parsing string by string?) – intruder May 26 '16 at 19:45

1 Answers1

2

This will do what you want on your example input:

awk -F "[ ]" '{sub(/.*:/, "", $7) sub(/.*:/, "", $8); printf "%s,%s,%s,%s,%s,%s,%s,%s_%s,%s,%s,%d\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, "0x"$12}' input.txt

Explanation of parts:

awk printf allows you to specify an output format, so you can manually specify which fields you want to delimit with , and which you want to delimit with _.

-F "[ ]" forces the field separator to be a single space so that it knows there is an empty field between two single spaces. The default behavior would be to allow multiple spaces to be a single delimiter, which is not what you want according to the question.

The sub function allows you to do regular expression replacement, in this case removing the ..: prefix in fields 7 and 8.

For field 12, we tell printf to output as a number (%d) and give as input the string in prefixed by 0x so that it interprets it as hexadecimal.

Note: If it's not always the case that you want the output to be $8_$9, then you actually need to parse the hexadecimal prefix and count off characters in order to determine where the field ends. If that's the case, I would personally prefer to write the whole thing in something else, e.g. Python.

leekaiinthesky
  • 5,413
  • 4
  • 28
  • 39