0

In my program, I'm grep-ing via NSTask. For some reason, sometimes I would get no results (even though the code was apparently the same as the command run from the CLI which worked just fine), so I checked through my code and found, in Apple's documentation, that when adding arguments to an NSTask object, "the NSTask object converts both path and the strings in arguments to appropriate C-style strings (using fileSystemRepresentation) before passing them to the task via argv[]" (snip).

The problem is that I might grep terms like "Río Gallegos". Sadly (as I checked with fileSystemRepresentation), that undergoes the conversion and turns out to be "RiÃÅo Gallegos".

How can I solve this?

-- Ry

ryyst
  • 9,563
  • 18
  • 70
  • 97

1 Answers1

1

The problem is that I might grep terms like "Río Gallegos". Sadly (as I checked with fileSystemRepresentation), that undergoes the conversion and turns out to be "RiÃÅo Gallegos".

That's one possible interpretation. What you mean is that “Río Gallegos” gets converted to “Ri\xcc\x81o Gallegos”—the UTF-8 bytes to represent the decomposed i + combining acute accent.

Your problem is that grep is not interpreting these bytes as UTF-8. grep is using some other encoding—apparently, MacRoman.

The solution is to tell grep to use UTF-8. That requires setting the LC_ALL variable in your grep task's environment.

The quick and dirty value to use would be “en_US.UTF-8”; a more proper way would be to get the language code for the user's primary preferred language, replace the hyphen, if any, with an underscore, and stick “.UTF-8” on the end of that.

Peter Hosey
  • 95,783
  • 15
  • 211
  • 370
  • Thanks for the answer, but it doesn't work... I also tried setting the LC_CTYPE and LANG variable in the grep task's environment, but still no luck. – ryyst Mar 29 '10 at 13:55
  • How did you determine that grep is interpreting the bytes the way you showed in your question? – Peter Hosey Mar 29 '10 at 14:26
  • Via NSString's fileSystemRepresentation method and NSLog() statements. Experimenting showed that only strings without "non-standard" characters such as 'í' work. I see that this is no proof, but it's strong evidence. – ryyst Mar 29 '10 at 14:45
  • And how are you viewing the NSLog output? – Peter Hosey Mar 29 '10 at 15:20
  • With the debugger console in XCode. – ryyst Mar 29 '10 at 15:47
  • OK, then. Try this: `NSLog(@"%@ = %lu bytes", myString, (unsigned long)strlen([myString fileSystemRepresentation]));` What does that log? – Peter Hosey Mar 29 '10 at 15:54
  • If myString is "Río Gallegos" (without quotes), the output is: Río Gallegos = 14 bytes – ryyst Mar 29 '10 at 16:39
  • By the way, if I try: NSLog(@"%s = %lu bytes", "Río Gallegos", (unsigned long)strlen("Río Gallegos")); It logs: R√≠o Gallegos = 13 bytes – ryyst Mar 29 '10 at 16:52
  • ryyst: Interesting. That output seems right, so grep is indeed misinterpreting it. Either that, or the target text really doesn't contain the pattern. (Perhaps the target text is not UTF-8, or uses a different normalization form? I don't think grep really understands encodings or Unicode.) – Peter Hosey Mar 30 '10 at 08:20
  • Well, running "grep "Río Gallegos" " does show results, so I guess it really is an encoding problem. This little snippet (http://pastebin.org/128441) shows that strings encoded with fileSystemRepresentation are actually extremely limited. I'm thinking about just using system() calls, NSTask is really annoying me. – ryyst Mar 30 '10 at 08:49
  • ryyst: Um. Well, that code explains it. First, neither NSTask nor grep is what's interpreting the bytes as MacRoman; NSString is. And it's doing that because *you told it to*. So, don't do that. The bytes are UTF-8, so interpret them as such. (Also, how are you getting “Río Gallegos” from a pointer to a `char` variable into which you've assigned an `int`?) – Peter Hosey Mar 30 '10 at 10:35
  • Okay, but even if I messed up with encodings in the snippet, that doesn't really explain the problem I'm experiencing with NSTask, does it? This (http://lists.apple.com/archives/Cocoa-dev/2007/Apr/msg01324.html) might also be an interesting read, as it covers exactly my problem. However, they all just say that it should just work OOB, which it obviously doesn't in my case... By the way, I can post all the code related to NSTask, if that simplifies matters for you. And thanks for all your efforts! – ryyst Mar 30 '10 at 11:52
  • I think we need to see where this mystical “s” is really coming from. Assigning an `int` to a `char` variable does not make a valid string in any encoding. – Peter Hosey Mar 30 '10 at 12:06
  • Hm, I always thought that was the C way of listing characters - apparently I'm wrong? Anyway, how are these two problems related to each other? – ryyst Mar 30 '10 at 14:32
  • Well, the code you're showing is fantastically unlikely to produce the result you've claimed. You've declared a variable holding *a* `char`—just one, not an array of them. Then, in a loop, you assigned an `int` into this variable; the first time through the loop, it is zero. Then you take the address of this `char`, and treat it as a C string; the character at the pointer being zero, this is an empty C string. The second time through the loop, the character at the pointer is 1 and any characters thereafter could be anything—you'll get random garbage, almost certainly not a person's name. – Peter Hosey Mar 30 '10 at 15:00
  • Wild theory: Did you mean to represent the integer as decimal digits? Assigning to a `char` variable (or an array of `char`) won't do that; the only “conversion” there is lopping off the bits that won't fit. Assigning 1, say, to a `char` variable will put a 1 (as in, 0x01, not `'1'`, which is 0x31) byte in it. If converting the number to a decimal representation in a string is what you meant to do, then use NSString's `stringWithFormat:`. – Peter Hosey Mar 30 '10 at 15:06
  • No, I didn't try to create useful strings or anything, I just wanted to list some characters and see what fileSystemRepresentation would do with them. What I now tried is putting a line with "RiÃÅo Gallegos" inside my text file. When passing "Río Gallegos" as argument to my NSTask object, it now indeed finds the line I added – grep is apparently really misinterpreting the argument. I still don't know how to keep either NSTask or grep from doing what they do now ... – ryyst Mar 30 '10 at 16:51
  • What do you mean “grep is… misinterpreting the argument”? If it finds the string, doesn't that mean it interpreted it correctly? And why would you put “RiÃÅo Gallegos” into the text file? You should put in “Río Gallegos” as UTF-8, then load the data from the file, decode the data as UTF-8 to get a string ( http://developer.apple.com/mac/library/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/initWithData:encoding: ), and pass that string to the task. – Peter Hosey Mar 30 '10 at 22:43
  • Oh, and don't use TextEdit to edit the file; it's dumb about UTF-8. Use TextWrangler instead: barebones.com/products/textwrangler Another way would be to put the string in your Info.plist. However, assuming that you're not going to get the real string that you'll use in your shipping app from either of these sources, you'll have to fix wherever you're really creating the string. – Peter Hosey Mar 30 '10 at 22:45
  • It's not a problem with the text file, nor myself messing up with NSStrings. It's NSTask doing a conversion it shouldn't and grep not understanding the arguments right anymore. I'm not sure if the problem can be solved at all. – ryyst Mar 31 '10 at 11:07