4

This is an extension to a related question answered Here

I have a weekly csv file which needs to be parsed. it looks like this.

"asdf","asdf","asdf","asdf"

But sometimes there are text fields which contain an extra unescaped double quote string like this

"asdf","as "something" df","asdf","asdf"

From the other posts on here, I was able to put together a regex

(?m)""(?![ \t]*(,|$))

which matches two successive double quotes, only "if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between"

now this finds only double quotes in succession. How do i modify it to find and replace/delete the double quotes around "something" in the file?

thanks.

Community
  • 1
  • 1
stevenjmyu
  • 926
  • 4
  • 16
  • 31
  • 2
    Whatever language you're using almost certainly already has a CSV parser - have you checked whether that is already capable of handling malformed input? (and thus saves you re-inventing the wheel) – Peter Boughton Jul 05 '10 at 16:14
  • i'm not using any programming languages, it's part of a process that basically automates a bunch of files for batch processing daily. and the client uses an automation program called winautomation which basically have a built-in Replace Text action that accepts .NET flavor regex in the find/replace parameters. – stevenjmyu Jul 05 '10 at 17:47
  • Ok, well all that should have been part of the question - especially the ".NET flavor regex" part. – Peter Boughton Jul 05 '10 at 18:36

3 Answers3

7
(?<!^|,)"(?!,|$)

will match a double quote that is not preceded or followed by a comma nor situated at start/end of line.

If you need to allow whitespace around the commas or at start/end-of-line, and if your regex flavor (which you didn't specify) allows arbitrary-length lookbehind (.NET does, for example), you can use

(?<!^\s*|,\s*)"(?!\s*,|\s*$)
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • wow, Thanks a bunch, Tim. It's .NET flavor. I'm currently using a text find/replace function within an automation program called winautomation. however, using both your regex in a find and replace returns a replacement of the first double quote and the last double quote of every line. it does seem to find and replace the unescaped double quotes within each text field. so using a "asdf","as "something" df","asdf" as an example, find and replace with ^, i get ^asdf","as ^something^ df","asdf^ how do i remedy the first and last "? – stevenjmyu Jul 05 '10 at 17:35
  • You need to set the option to allow `^` and `$` to match start and end of lines (instead of the entire input). In .NET, that option is called `RegexOptions.Multiline`. I don't know if you can pass that option to winautomation. – Tim Pietzcker Jul 05 '10 at 18:11
3

I was using VIM to remove nested quotes in a .CSV file and this worked for me:

"[^,"][^"]*"[^,]
Rick Tilley
  • 126
  • 1
  • 6
1

In vim I used this to remove all the unescaped quotes.

:%s/\v("(,")@!)&((",)@<!")&("(\n)@!)&(^@<!")//gc

detailed explanation is,

: - start the vim command
    % - scope of the command is the whole file
    s - search and replace
        / - start of search pattern
        \v - simple regex syntax (rather than vim style)
            (
                " - double quote
                (,") - comma_quote
                @! - not followed by
            )
            & - and
            (
                (",) - quote_comma
                @<!- does not precedes
                " - double quote
            )
            & - and
            (
                " - double quote
                (\n) - line end
                @! - not followed by
            )
            & - and
            (
                ^ - line beginning
                @<! - does not precedes
                " - double quote
            )
        / - end of search pattern and start of replace pattern
             - replace with nothing (delete)
        / - end of replace pattern
    g - apply to all the matches
    c - confirm with user for every replacement

this does the job fairly quickly. The only instance this fails is when there are instances of "," pattern in the data.