4

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?

Not a match:

"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"

Match:

"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"
Gumbo
  • 643,351
  • 109
  • 780
  • 844
Even Mien
  • 44,393
  • 43
  • 115
  • 119

5 Answers5

5

Try this:

(?m)""(?![ \t]*(,|$))

Explanation:

(?m)       // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
""         // match two successive double quotes
(?!        // start negative look ahead
  [ \t]*   //   zero or more spaces or tabs
  (        //   open group 1
    ,      //     match a comma 
    |      //     OR
    $      //     the end of the line or string
  )        //   close group 1
)          // stop negative look ahead

So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".

(i) besides being the normal start-of-the-string and end-of-the-string meta characters.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Uh, won't this match ` "asdf", "" ` (should not match) but not ` "asdf""asdf", "asdf" ` (should match)? – Lucero Oct 21 '09 at 16:19
  • @Lucero: no, exactly the opposite. It does NOT match the double quotes in `"asdf", ""`, and it DOES match the double quotes in `"asdf""asdf", "asdf"`. – Bart Kiers Oct 22 '09 at 06:47
  • How would the expression change if we were to find non-successive double quotes like "asdf "something" asdf ", "asdf", ... – stevenjmyu Jul 05 '10 at 15:33
  • 1
    This does not work in NodeJS regex engine. As NawaMan explains below, perhaps the solution really depends a lot on what engine you are using. – giacecco Feb 06 '14 at 09:09
2

Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.

My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.

NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.

Here is my code:

orgTexts = [
    '"asdf","asdf"',
    '"", "asdf"',
    '"asdf", ""',
    '"adsf", "", "asdf"',
    '"asdf""asdf", "asdf"',
    '"asdf", """asdf"""',
    '"asdf", """"'
]

orgTexts.each{|orgText|
    # Preprocessing - Eliminate spaces before and after comma
    # Here is needed if you may have spaces before and after a valid comma
    orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')

    # Detect valid character (non-quote and valid quote)
    resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
    # resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
    # [^\"]       ===> A non qoute
    # |           ===> or
    # ^\"         ===> beginning quot
    # |           ===> or
    # \"$         ===> endding quot
    # |           ===> or
    # (?<=,)\"    ===> quot just after comma
    # \"(?=,)     ===> quot just before comma
    # (?<=\\\\)\" ===> escaped quot

    #  This part is to show the invalid non-escaped quots
    print orgText
    print resText.gsub(Regexp.new('"'), '^')

    # This part is to determine if there is non-escaped quotes
    # Here is the actual matching, use this one if you don't want to know which quote is un-escaped
    isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
    # Basicall, it match it from start to end (^...$) there is only a valid character

    print orgText + ": " + isMatch
    print 
    print ""
    print ""
} 

When executed the code prints:

"asdf","asdf"
-------------
"asdf","asdf": false


"","asdf"
---------
"","asdf": false


"asdf",""
---------
"asdf","": false


"adsf","","asdf"
----------------
"adsf","","asdf": false


"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true


"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true


"asdf",""""
--------^^-
"asdf","""": true

I hope I give you some idea here that you can use with other engine and language.

NawaMan
  • 25,129
  • 10
  • 51
  • 77
0
".*"(\n|(".*",)*)

should work, I guess...

aviraldg
  • 9,531
  • 6
  • 41
  • 56
  • 1
    Note that `.*` is greedy and may eat up any number of characters, including some `"`, so that for instance `"adsf", "", "asdf"`would also be a match. (And the performance may be pretty bad because it matches way to much even in the case of a non-match.) – Lucero Oct 21 '09 at 16:12
0

For single-line matches:

^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"

or for multi-line:

(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"

Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.

Lucero
  • 59,176
  • 9
  • 122
  • 152
  • Hi @Lucero, in case it was useful for you to know, that does not work on NodeJS regex engine: > var x = "\"field 1\", \"field 2 with \"something\" in it\",\"field 3\""; undefined > x.match(/^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"/) null – giacecco Feb 06 '14 at 09:14
  • @giacecco, the question here asked to look for two consecutive double quotes (`""`) within quoted strings, which is not what you have in your test string. – Lucero Feb 06 '14 at 12:45
  • you are right and I am sorry. I was confused by the fact that according to the CVS RFC, double-double-quotes are actually valid CSV: it's the way double-quotes are supposed to be escaped http://tools.ietf.org/html/rfc4180#section-2 . A similar issue brought me here, but not the original one Even Mien had. – giacecco Feb 06 '14 at 16:33
0

Try this regular expression:

"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"

That will match any quoted string with at least one pair of unescaped double quotes.

Gumbo
  • 643,351
  • 109
  • 780
  • 844