1

This is very similar to

Regular expression to find unescaped double quotes in CSV file

However, the solutions presented don't work with Node.js's regex engine. Given a CSV string where columns are quoted with double quotes, but some columns have unescaped double quotes in them, what regex could be used to match these unescaped quotes and just remove them.

Example rows

 "123","","SDFDS   SDFSDF EEE "S"","asdfas","b","lll"
 "123","","SDFDS   SDFSDF EEE "S"","asdfas","b","lll"

So the two double quotes surrounding the S in the third column would get matched and removed. Needs to work in Node.js (14.16.1)

I have tried (?m)""(?![ \t]*(,|$)) but get a Invalid regular expression: /(?m)""(?![ \t]*(,|$))/: Invalid group exception

Nishant S Vispute
  • 713
  • 1
  • 7
  • 21
Display name
  • 1,109
  • 1
  • 15
  • 31
  • 2
    What solution does not work in JavaScript? [There are](https://stackoverflow.com/questions/1601780/regular-expression-to-find-unescaped-double-quotes-in-csv-file) quite a few there. – Wiktor Stribiżew May 08 '21 at 20:18
  • Please show us the code you tried. The relevant part of it. Thanks. – line-o May 08 '21 at 20:19
  • As I originally noted, the solutions there don't work in Node.js. I have updated my question with the exception. – Display name May 08 '21 at 20:52

1 Answers1

0

I don't know much about node.js, but assuming it is like the JavaScript flavor of regex then I have the following comments about the example you took from the prior answer:

  1. I think your example is choking on the first element, (?m) which is unsupported in Javascript. However, that part is not essential to your task. It only turns on multiline processing and you don't need that if you feed the regex engine each line individually. If you find you still want to feed it a multiline string, then you can still turn on multiline in JavaScript - you do it with the "m" flag after the final delimiter, "/myregex/m". All of the other elements, including the negative lookahead are supported by JavaScript and probably by your engine as well. So, drop the (?m) part of your expression and try it again.
  2. Even after you get it to work, the example row you provided will not be parsed according to your expectations by the sample regular expression. Its function is to identify all occurrences of two double-quotes that are not followed by a comma (or end of string). The ONLY two occurrences of doubled quotes in your example each have a comma after, so you will get no matches on this regex in your example.
  3. It seems like you want some context-sensitive scanning to match and remove the inner pairs of double quotes while leaving the outer ones in place and handling commas inside your strings and possibly correctly quoted double quotes. Regular expression engines are really bad at this kind of processing and I don't think you are going to get satisfactory results whatever you come up with.
  4. You can get an approximate solution to your problem by using regex once to parse the individual elements of the .csv stripping the outer quotes as you go and then running a second regex against each parsed element to either remove single occurrences of double quote or adding a second double-quote, where necessary. Then you can reassemble the string under program control.
    This still will break if someone embeds a "", sequence in a data field string, so it's not perfect but it might be good enough for you.

The regex for splitting the .csv and stripping the double quotes is:

/(("(.*?)")|([^,]*))(,|$)/gm

This will accept either a "anything", OR a anything, repeatedly until the source is exhausted. Because of the capturing groups, the parsed text will either by in $3 (if the field was quoted) or $4 (if it was not quoted) but not both. Here's a regexpReplace of your string with $3&$4 and a semicolon after each iteration (I took the liberty of adding a numeric field without the quotes so you could see that it handles both cases):

"123","","SDFDS   SDFSDF EEE "S"",456,"asdfas","b","lll"

RegexpReplace(<above>,"((""(.*?)"")|([^,]*))(,|$)","$3$4;")
=>  123;;SDFDS   SDFSDF EEE "S";456;asdfas;b;lll;;

See how the outer quotes have been stripped away. Now it's a simple thing to go through all the matches to remove all the remaining quotes, and then you can reconstruct the string from the array of matches.

Chris Maurer
  • 2,339
  • 1
  • 9
  • 8