CSV parsing using awk/sed with possible ',' character in each column

Question

My CSV file has text of the following form:

"abcd, xyz", abcd, 012
"xyz, 123, abcd", 123, "abcd, pqr"

Each column may have comma separated text. In such a case, text inside column is enclosed in a pair of '"' character.

I'm seeking help to parse such a file using sed or awk.

Thank you very much in advance.

Such questions are asked and answered daily. __Please__ make an effort to search existing questions/answers. — devnull, Feb 10 '14 at 11:23
Please read my question in a greater detail. The CVS file may contain a text which is enclosed in a pair of '"' characters. — sameer oak, Feb 10 '14 at 11:24
Yes, I said so after reading your question _in a greater detail_. (FWIW, I've answered this in the past using `awk`, using `perl`.) — devnull, Feb 10 '14 at 11:27
See [this](http://stackoverflow.com/questions/20284578/csv-parse-with-ifs-bash-choose-as-the-separator/) question. The only difference is that your example uses `,` as the delimiter whereas the referenced question makes use of `;`. — devnull, Feb 10 '14 at 11:40
You may also prefer to consider a robust/general approach like: http://perlmaven.com/how-to-read-a-csv-file-using-perl — Håkon Hægland, Feb 10 '14 at 12:40

Ed Morton · Accepted Answer · 2014-02-11T01:49:01.090

Simplest thing to do is convert either the commas within the quotes or between the fields to some other char, e.g. this will convert every "," between fields to a tab char:

$ awk 'BEGIN{FS=OFS="\""} {for (i=1;i<=NF;i+=2) gsub(/[[:space:]]*,[[:space:]]*/,"\t",$i)} 1' file
"abcd, xyz"     abcd    012
"xyz, 123, abcd"        123     "abcd, pqr"

Then of course you need to find some char that can't occur in your input so you could choose a control char or SUBSEP or something.

Alternatively this will convert every "a" to "aA" and every separator to "aB" so then you KNOW your separator can't occur in your input:

$ awk 'BEGIN{FS=OFS="\""} {gsub(/a/,"aA"); for (i=1;i<=NF;i+=2) gsub(/[[:space:]]*,[[:space:]]*/,"aB",$i)} 1' file
"aAbcd, xyz"aBaAbcdaB012
"xyz, 123, aAbcd"aB123aB"aAbcd, pqr"

and you can do:

$ awk 'BEGIN{FS=OFS="\""} {gsub(/a/,"aA"); for (i=1;i<=NF;i+=2) gsub(/[[:space:]]*,[[:space:]]*/,"aB",$i)} 1' file |
awk -F'aB' '{gsub(/aA/,"a"); print $0; for (i=1;i<=NF;i++) print "\tField " i " = <" $i ">"}'
"abcd, xyz"aBabcdaB012
        Field 1 = <"abcd, xyz">
        Field 2 = <abcd>
        Field 3 = <012>
"xyz, 123, abcd"aB123aB"abcd, pqr"
        Field 1 = <"xyz, 123, abcd">
        Field 2 = <123>
        Field 3 = <"abcd, pqr">

If you'd like to do it all in one command:

$ awk '
function decomma() {
    FS = OFS = "\""
    $0 = $0
    gsub(/a/,"aA")
    for (i=1;i<=NF;i+=2)
        gsub(/[[:space:]]*,[[:space:]]*/,"aB",$i)
    gsub(/aA/,"a")
    FS = "aB"
    $0 = $0
}

{
    print $0
    decomma()
    for (i=1;i<=NF;i++)
       print "\tField " i " = <" $i ">"
}
' file
"abcd, xyz", abcd, 012
        Field 1 = <"abcd, xyz">
        Field 2 = <abcd>
        Field 3 = <012>
"xyz, 123, abcd", 123, "abcd, pqr"
        Field 1 = <"xyz, 123, abcd">
        Field 2 = <123>
        Field 3 = <"abcd, pqr">

Exactly, I did the same. Just a slight change, though, I converted the ',' enclosed in a pair of '"' by '|' and did the further parsing. I had implemented this solution but I thought I could still optimize the same through a single sed/awk. Thank you very much, anyway. I'll give your solution a "check" mark. — sameer oak, Feb 11 '14 at 03:18
Yeah, that's fine if you can come up with a char that is guaranteed not to be in your input file. An alternative might be to use the FPAT variable if you have GNU awk. See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for details. For full blown CSV parsing with awk including handling escaped quotes and newlines within fields, google lorance stinson's csv parser. — Ed Morton, Feb 11 '14 at 03:21

score 0 · Answer 2 · answered Feb 10 '14 at 11:36

Another CSV format is as follows (the only quoted fields are those that contain commas):

  field1, "field2,with,commas"  ,  field3  ,  "field4,foo"

We have a mixture of quoted and unquoted fields here, which cannot parsed directly by any value of FS (that I know of, at least). However, we can still get the fields using match() in a loop (and cheating a bit):

c=0
$0=$0","                                   # yes, cheating
while($0) {
  match($0,/ *"[^"]*" *,|[^,]*,/)
  f=substr($0,RSTART,RLENGTH)             # save what matched in f
  gsub(/^ *"?|"? *,$/,"",f)               # remove extra stuff
  print "Field " ++c " is " f
  $0=substr($0,RLENGTH+1)                 # "consume" what matched
}

above data are quoted from: http://web.archive.org/web/20120531065332/http://backreference.org/2010/04/17/csv-parsing-with-awk/

CSV parsing using awk/sed with possible ',' character in each column

2 Answers2