1

I am trying to find a regular expression that will not match a delimiter if it is wrapped in double quotes. But it must also be able to handle values that have a single double quote. I have the first part down with the below expression where DELIMITER could be just about anything but is mainly commas, pipes, and double pipes:

DELIMITER(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)

This handles a properly formed CSV rowlike apple, "banana, and orange", grape. I can split on the delimiter and get the values:

['apple', 'banana, and orange', 'grape']

My problem is that I may encounter a line like apple, "banana, and orange, grape. In this case I would want to get the values:

['apple', '"banana', 'and orange', 'grape']

However, I get:

['apple, "banana', 'and orange', 'grape']

It basically ignores all of the commas up to the double quote.

The logic that I have in my head is that I want to ignore a comma if it is preceded by a double quote, but only if it has a double quote in front of it as well. My first thought was to play around with a look-behind, but I can't get that to work due to look-behinds not able to handle quantifiers (correct me if this is wrong).

I am using Qt QRegExp which I understand is more or less similar to the Perl regex engine. Please let me know if there is more information that I can provide. I know regular expressions can be finicky based on your setup, and I hope I have explained what I'm looking for well enough!

  • What would you be looking for in the case of: apple, "banana,orange, grape,"peach, cherry, lemon"? Why not validate against mismatched quotes and make the user fix their inputs? – RegularlyScheduledProgramming Feb 25 '16 at 18:45
  • I would expect it to return `['banana,orange,grape,"peach,cherry,lemon']` I am leaning towards just skipping over the row and letting the upstream system know about the bad data, but now I am just curious to see if this is even possible. Either this post will die, someone will let me know it isn't currently possible with regex only, or this question is going to produce one awesome expression! – rgrwatson85 Feb 26 '16 at 01:57

1 Answers1

0

It's not QT but boost::tokenizer, which is header-only, has support for escaped delimited text formats.

From the example usage at the Boost docs: http://www.boost.org/doc/libs/1_60_0/libs/tokenizer/escaped_list_separator.htm

// simple_example_2.cpp
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "Field 1,\"putting quotes around fields, allows commas\",Field 3";
   tokenizer<escaped_list_separator<char> > tok(s);
   for(tokenizer<escaped_list_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

In the malformed case tok returns a single token, which isn't what you're looking for. You're looking for non-standard1 parsing, consider writing a small state machine instead of a regular expression.

1. As much as there is a standard for delimited text

  • I do not have boost available. I'm unsure that this would work correctly for the **malformed** example. What is the output be for `apple, "banana, and orange, grape`. I would want the double quote in front of banana to be taken literally, so when it gets put into the database it is `"banana`. – rgrwatson85 Feb 25 '16 at 18:24