1

Assume a wrongly formatted csv snippet using semicolons as field terminator:

abc;d" "e"f;"ijk"

According to RFC4180 a dquote inside a field should be represented by two dquotes:

abc;d"" ""e""f;"ijk"

I've tried to achieve this by a sed script that matches any dquote not preceeded or followed by the field terminator (here ;):

echo 'abc;d" "e"f;"ijk"' | sed -e 's/\([^;]\)"\([^;]\)/\1""\2/g'

The result is almost good:

abc;d"" "e""f;"ijk"

except the double quote before the e is not matched and therefore not duplicated.

Can anyone explain me why this doesn't work, since before and after the e there's no semicolon.

1 Answers1

1

Your second quote is not matched because the match would be space quote e, and the space is already consumed by the preceding d quote space.

This is a textbook example for lookaround matching, which matches but does not consume. Unfortunately lookaround matching is not implemented in sed. If I had to use sed for this, I would first replace valid quotes with some non-occurring character, then double all remaining quotes, then put the valid quotes back.

Perl has lookaround matching, which is (a little) easier on the eyes:

$ echo 'abc;d" "e"f;"ijk"' | perl -pe 's/(?<!;)"(?![;\n])/""/'g
abc;d"" ""e""f;"ijk"

Translation: a quote not preceded by the regex ;, not followed by the regex [;\n].

The \n is there because perl considers it part of the line, and will therefore match the last quote unless we forbid it.

Law29
  • 3,557
  • 1
  • 16
  • 28