2

Currently, I tend to remove comma in a string for a CSV line.

Here are my expectation

    // (1) ",123,456,"     -> ",123456,"
    // (2) ","abc,def","   -> ","abcdef","
    // (3) ","123,456","   -> ","123456","
    // (4) ","abcdef,","   -> ","abcdef","

I wrote the following code

    String[] test = {
        "\",123,456,\"",
        "\",\"abc,def\",\"",
        "\",\"123,456\",\"",
        "\",\"abcdef,\",\""            
    };

    final Pattern commaNotBetweenQuotes = Pattern.compile("(?<!\"),(?!\")");

    for (String d : test) {
        System.out.println("O : " + d);
        String result = commaNotBetweenQuotes.matcher(d).replaceAll("");
        System.out.println("R : " + result);
    }

However, I fail in case (4)

Here is the output I get

O : ",123,456,"
R : ",123456,"

O : ","abc,def","
R : ","abcdef","

O : ","123,456","
R : ","123456","

O : ","abcdef,","
R : ","abcdef,","   <-- we expect the comma after "f" being remove, as 
                        it is inside string quote

May I know how I can further improve this regular expression pattern?

    final Pattern commaNotBetweenQuotes = Pattern.compile("(?<!\"),(?!\")");

I get the code from Different regular expression result in Java SE and Android platform

What I understand on the pattern is that

If a comma doesn't have double quote on its left AND on its right, replace it with empty string.

I try to use

     final Pattern commaNotBetweenQuotes = Pattern.compile("(?<!\"),(?!\")|(?<![\"0-9]),(?=\")");

with idea

If a comma doesn't have double quote on its left AND on its right, replace it with empty string.

OR

If a comma has double quote on its right, and non-digit / non double quote on its left, replace it with empty string.

However, the "solution" is not elegant. What I really want is, remove the comma within string literal. remove comma within integer. retain comma used as CSV seperator.

Try not to use $1, as Android will use "null" instead of "" for unmatched group.

Community
  • 1
  • 1
Cheok Yan Cheng
  • 47,586
  • 132
  • 466
  • 875
  • what is wrong with your results – aaronman Jun 14 '13 at 02:26
  • Honestly, using regex to parse CSV is a lot of trouble. Try using opencsv. http://opencsv.sourceforge.net/ – austin Jun 14 '13 at 02:40
  • Not. I'm actually currently using opencsv to perform parsing. The above step is to pre-filtering on some case which cannot handle by opencsv. for example : "this is string", 123, "this is another, string". There are in fact only 3 elements in the line, but opencsv treat them as 4. – Cheok Yan Cheng Jun 14 '13 at 02:45
  • 1
    I don't get your CSV pattern. Seems you arbitrarily know how to parse the string. The case (1) is `",123,456,"` is a quoted string with commas inside, but you let the result be: `",123456,"`. You removed the comma between the two outter commas. But in the case (4) you want to remove the comma in the string `"abcdef,"`. – Sebastian Jun 14 '13 at 02:56
  • is a bit confusing. in fact, our csv input has 2 type elements. 1 is literal string, another is integer. for, take an example "string", 123,456, "string". it should really be filtered to "string", 123456, "string" – Cheok Yan Cheng Jun 14 '13 at 02:58
  • for case 4, since it is "comma in a string literal", not "comma in integer", we want to remove it for easy processing – Cheok Yan Cheng Jun 14 '13 at 02:59
  • You case 3 is failing as well; – acdcjunior Jun 14 '13 at 03:32
  • The pattern `If a comma doesn't have double quote on its left AND on its right, replace it with empty string.` is correct. The case 4 is not replaced because it does not have a double quote on the left side. And the pattern says **AND**, so you have to have both. – acdcjunior Jun 14 '13 at 03:37
  • 1
    @acdcjunior is right, case (4) `","abcdef,","` is wrong. First, it starts with one quote. That's not an empty string nor a literal. If you are passing a partial part of a CSV, make sure that you have complete fields. If that a partial CSV string, then all your tests will lead you to undefined behaviours. – Sebastian Jun 14 '13 at 04:01

2 Answers2

2

Description

To replace all the commas stuck in the middle of the strings use following, the empty capture group (\b) should avoid problem with android where if the back reference $# is not matched then language inserts a null character instead of nothing:

Regex: ((?:",\d|\d,")|",")|(\b),

Replace with: $1

enter image description here

Input

",123,456," 
","abc,def","
","123,456"," 
","abcdef,","

Output

",123456," 
","abcdef","
","123456"," 
","abcdef","

Disclaimer

This assumes the commas you want to keep are all surrounded by quotes like "alpha","beta","1234"

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Thanks. However, using $1 will yield problem in Android platform as you can see in my link. That's why I avoid them :) – Cheok Yan Cheng Jun 14 '13 at 03:32
  • See article http://stackoverflow.com/questions/12377838/android-replace-with-regex for how to use replace with $# format on the android platform – Ro Yo Mi Jun 14 '13 at 03:38
  • The link is not relevant, as the "problem" in Android is that, for unmatched group, they are replacing it with "null" instead of "" – Cheok Yan Cheng Jun 14 '13 at 03:46
  • @Denomales The first result does not yield what the OP wants. For input `",123,456,"` the OP wants the output: `",123456,"`. – Sebastian Jun 14 '13 at 03:55
0

You can also find second occurence of , in your String and then replace it with "". Here you have some examples:

Community
  • 1
  • 1
Marek
  • 3,935
  • 10
  • 46
  • 70