0

I have a really weird fileformat here, which uses tabs and spaces in any amount to seperate fields (even trailing and leading ones). Another speciality is, that fields can be added with spaces in them, which are then escaped in a CSV manner.

One example:

   0    "some string" 234      23947     123 ""some escaped"string""

I try to parse such columns with awk and i would need to have every item in an array, e.g.

foo[0] -> 0
foo[1] -> "some string"
foo[2] -> 234
foo[3] -> 23947
foo[4] -> 123
foo[5] -> ""some escaped"string""

Is this even possible? I read http://web.archive.org/web/20120531065332/http://backreference.org/2010/04/17/csv-parsing-with-awk/ which says that parsing csv is already very hard (For the beginning it should be enough to parse normal strings with spaces, the escaped variant is very rare)

Before i mess around a long time: Is there any way to do this in awk or would i better use some other language?

reox
  • 5,036
  • 11
  • 53
  • 98
  • your time would be better spent coaxing a properly formatted output from your producer system ;-/ (Yes, CSV and unix tools have different philosophies underlying them.) Good luck. – shellter Nov 10 '16 at 20:56
  • @shellter haha :D this will probably not happen... the files are generated by some software only running on windows, with some half written documentation and i try to convert them into a somewhat readable format... :/ The developer already said he will not support any software beside his own, so the only way is to convert the files by my own. I wonder how he can read the files in his product – reox Nov 10 '16 at 21:16
  • After a quick glance I would say the solution should be stateful or if using regex lookahead would be needed which Awk doesn't support. I should say its really hard to do with awk and someone's gonna code it in 15 mins... – James Brown Nov 10 '16 at 23:03
  • 1
    There are many accepted CSV formats but there is no CSV format that what you have posted conforms to. Parsing CSV with awk is easy but your files are not CSV, they are just a mess wrt the usage of quotes (e.g. `"..."` is a field so `""` would be an empty field but `""` is also the start and and of a field if that field contains `"`). Are you sure you didn't just copy/paste incorrectly? There is no other tool/language that would handle that text any better than awk. – Ed Morton Nov 11 '16 at 02:08

1 Answers1

1

With GNU awk for FPAT:

$ cat tst.awk
BEGIN { FPAT="\\S+|\"[^\"]+\"|,[^,]+," }
{
    gsub(/@/,"@A")
    gsub(/,/,"@B")
    gsub(/""/,",")
    for (i=1; i<=NF; i++) {
        gsub(/,/,"\"\"",$i)
        gsub(/@B/,",",$i)
        gsub(/@A/,"@",$i)
        print i, $i
    }
}

$ awk -f tst.awk file
1 0
2 "some string"
3 234
4 23947
5 123
6 ""some escaped"string""

To understand what that's doing, see https://stackoverflow.com/a/40512703/1745001

Community
  • 1
  • 1
Ed Morton
  • 188,023
  • 17
  • 78
  • 185