7

I have to read .csv file which has three columns. While parsing the .csv file, I get the string in this format Christopher Bass,\"Cry the Beloved Country Final Essay\",cbass@cgs.k12.va.us. I want to store the values of three columns in an Array, so I used componentSeparatedByString:@"," method! It is successfully returning me the array with three components:

  1. Christopher Bass
  2. Cry the Beloved Country Final Essay
  3. cbass@cgs.k12.va.us

but when there is already a comma in the column value, like this Christopher Bass,\"Cry, the Beloved Country Final Essay\",cbass@cgs.k12.va.us it separates the string in four components because there is a ,(comma) after the Cry:

  1. Christopher Bass
  2. Cry
  3. the Beloved Country Final Essay
  4. cbass@cgs.k12.va.us

so, How can I handle this by using regular expression. I have "RegexKitLite" classes but which regular expression should I use. Please help!

Thanks-

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Developer
  • 6,375
  • 12
  • 58
  • 92
  • 1
    Does it need to be regexp, or a "low-tech" solution would be acceptable? – Sergey Kalinichenko Jan 31 '12 at 17:01
  • @dasblinkenlight: If you have alternate solution then I will appreciate that also. – Developer Jan 31 '12 at 17:02
  • hey could you send me your csv file???? – Inder Kumar Rathore Feb 03 '12 at 04:15
  • I think you are missing something.. your csv string should be like this `\"Christopher Bass\",\"Cry, the Beloved Country Final Essay\",\"cbass@cgs.k12.va.us\"` – Inder Kumar Rathore Feb 03 '12 at 04:16
  • @ Inder Kumar Rathore Yes, I can send you but how? And for your second comment! No, I am not missing anything. The csv string should like as you have written but when I parse the cvs created on Windows, it shows in the way I wrote above other wise I would not had any problem, and I could separate the string by "\",\"", and it works perfect, but its not happening! – Developer Feb 03 '12 at 09:35
  • Regular expressions should never be used to parse CSVs. Parsing all possible CSVs correct is impossible to get right using a Regex – Toad Aug 24 '12 at 08:17

5 Answers5

2

Any regular expression would probably turn out with the same problem, what you need is to sanitize your entries or strings, either by escaping your commas or by highlighting strings this way: "My string". Otherwise you will have the same problem. Good luck.

For your example you would probably need to do something like:

\"Christopher Bass\",\"Cry\, the Beloved Country Final Essay\",\"cbass@cgs.k12.va.us\"

That way you could use a regexp or even the same method from the NSString class.

Not related at all, but the importance of sanitizing strings: http://xkcd.com/327/ hehehe.

El Developer
  • 3,345
  • 1
  • 21
  • 40
  • What if there are 10,000 users who are entering the data in the data base using the app? – Developer Jan 31 '12 at 17:10
  • 1
    What do you mean by "what"? In order to understand this question, I would need some context and background about this. – El Developer Feb 02 '12 at 00:22
  • In you answer you have dropped the comma after Cry and I think you are saying that there should be no comma and quotes in the content of data! How I can control this if there are more that 10,000 users? – Developer Feb 06 '12 at 10:30
  • Whoops, completely missed that one. If those are new entries to the data base of the csv file, you can either "escape" the commas, that are characters in the strings and use the commas of the file to separate the columns. If you already have a csv file with 10 K users, well then I don't think this solution will work for you :S sorry. – El Developer Feb 07 '12 at 03:49
1

How about this:

componentsSeparatedByRegex:@",\\\"|\\\","

This should split your string whereever " and , appear together in either order, resulting in a three-member array. This of course assumes that the second element in the string is always enclosed in parentheses, and the characters " and , never appear consecutively within the three components.

If either of these assumptions is incorrect, other methods to identify string components may be used, but it should be made clear that no generic solution exists. If the three component strings can contain " and , anywhere, not even a limited solution is possible in such cases:

Doe, John,\"\"Why Unescaped Strings Suck\", And Other Development Horror Stories\",Doe, John <john.doe@dev.null>

Hopefully there is nothing like the above in your CSV data. If there is, the data is basically unusable, and you should look into a better CSV exporter.

Feysal
  • 623
  • 4
  • 7
0

The regex you're searching for is: \\"(.*)\\"[ ^,]*|([^,]*),

in ObjC: (('\"' && string_1 && '\"' && 0-n spaces) || string_2 except comma) && comma

NSString *str = @"Christopher Bass,\"Cry, the Beloved Country ,Final Essay\",cbass@cgs.k12.va.us,som";
NSString *regEx = @"\\\"(.*)\\\"[ ^,]*|([^,]*),";
NSMutableArray *split = [[str componentsSeparatedByRegex:regEx] mutableCopy];
[split removeObject:@""]; // because it will print always both groups even if the other is empty
NSLog(@"%@", split);

// OUTPUT:
2012-02-07 17:42:18.778 tmpapp[92170:c03] (
    "Christopher Bass",
    "Cry, the Beloved Country ,Final Essay",
    "cbass@cgs.k12.va.us",
    som
)

RegexKitLite will add both strings to the array, therefore you will end up with empty objects for your array. removeObject:@"" will delete those but if you need to maintain true empty values (eg. your source has val,,ue) you have to modify the code to the following:

str = [str stringByReplacingOccurrencesOfRegex:regEx withString:@"$1$2∏"];
NSArray *split = [str componentsSeparatedByString:@"∏"];

$1 and $2 are those two strings mentioned above, ∏ is in this case a character which will most likely never appear in normal text (and is easy to remember: option-shift-p).

0

Is the title guarantied to have the quotation marks? And is it the only component that can have them? Because then componentSeparatedByString:@"\"" should get you this:

  1. Christopher Bass,
  2. Cry, the Beloved Country Final Essay
  3. ,cbass@cgs.k12.va.us

Then use componentSeparatedByString:@"," or substringFrom/ToIndex: to get rid of the two commas in the first and last component.

Here's a solution using substring:

NSString* input = @"Christopher Bass,\"Cry, the Beloved Country Final Essay\",cbass@cgs.k12.va.us";
NSArray* split = [input componentsSeparatedByString:@"\""];
NSString* part1 = [split objectAtIndex:0];
NSString* part2 = [split objectAtIndex:1];
NSString* part3 = [split objectAtIndex:2];
part1 = [part1 substringToIndex:[part1 length] - 1];
part3 = [part3 substringFromIndex:1];

NSLog(part1);
NSLog(part2);
NSLog(part3);
Martin Gjaldbaek
  • 2,987
  • 4
  • 20
  • 29
  • If I use componentSeparatedByString:@"," then the text at position 2. will be break down in two parts then how will I track that two separated parts belongs to which row? – Developer Feb 09 '12 at 15:50
  • Please re-read the answer - the idea is to not use componentSeparatedByString:@"," on the text at position 2 at all. Use componentSeparatedByString:@"\"" to break it using the quotes instead of the commas. Then get rid of the commas (for this componentSeparatedByString:@"," could be used, but of course only on component 1 and 3 of the first split) – Martin Gjaldbaek Feb 10 '12 at 12:16
  • I've updated the answer with a solution that uses substring to get rid of the commas (to avoid confusion). – Martin Gjaldbaek Feb 10 '12 at 12:50
0

The last part looks like it will never contain a comma. Neither will the first one as far as I can see...

What about splitting the string like this:

NSArray *splitArr = [str componentsSeparatedByString:@","];
NSString *nameStr = [splitArr objectAtIndex:0];
NSString *emailStr = [splitArr lastObject];

NSString *contentStr = @"";
for(int i=1; i<[splitArr count]-1; ++i) {
    contentStr = [contentStr stringByAppendingString:[splitArr objectAtIndex:i]];
}

This will use the first and last string as is, and combine the rest into the content.

Kind of a hack, but a name and an email address will never contain a comma, right?