0

I got unicode text from a website and saved it to a NSString, the problem is that the text in the string is not encoded correctly so I see only symbols. The text on the website is mainly Hebrew Characters.

NSLog(@"%@", [trafficNodes[0] firstChild]);
NSLog(@"%@", [[trafficNodes[0] firstChild] content]);
NSLog(@"%@", [[[trafficNodes[0] firstChild] content] stringByReplacingPercentEscapesUsingEncoding:NSASCIIStringEncoding]);

This is what I see in the log:

2013-01-25 18:44:26.391 HTMLParsing[2450:c07] {
nodeContent = "\U05f3\U009e\U05f3\U00a2\U05f3\U2022\U05f3\U201c\U05f3\U203a\U05f3\U009f \U05f3\U009c\U05f3\U00a9\U05f3\U00a2\U05f3\U201d: 18:35\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U05f3\U201d\U05f3\U00d7\U05f3\U00a0\U05f3\U2022\U05f3\U00a2\U05f3\U201d \U05f3\U2013\U05f3\U2022\U05f3\U00a8\U05f3\U009e\U05f3\U00d7.\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0***\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U05f3\U009c\U05f3\U009e\U05f3\U00a1\U05f3\U2122\U05f3\U00a8\U05f3\U00d7 \U05f3\U2022\U05f3\U009c\U05f3\U00a7\U05f3\U2018\U05f3\U009c\U05f3\U00d7 \U05f3\U201c\U05f3\U2122\U05f3\U2022\U05f3\U2022\U05f3\U2014\U05f3\U2122\U05f3\U009d \U05f3\U2022\U05f3\U00d7\U05f3\U2013\U05f3\U009e\U05f3\U2022\U05f3\U00a0\U05f3\U2122\U05f3\U009d \U05f3\U2014\U05f3\U2122\U05f3\U2122\U05f3\U2019\U05f3\U2022: 918 - 800 - 1-800\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U05f3\U2018\U05f3\U00a0\U05f3\U00a1\U05f3\U2122\U05f3\U00a2\U05f3\U201d \U05f3\U2018\U05f3\U00a7\U05f3\U00a8\U05f3\U2018\U05f3\U00d7 \U05f3\U2018\U05f3\U00d7\U05f3\U2122 \U05f3\U00a1\U05f3\U20aa\U05f3\U00a8, \U05f3\U2019\U05f3\U2122\U05f3\U00a0\U05f3\U2022\U05f3\U00d7 \U05f3\U009e\U05f3\U00a9\U05f3\U2014\U05f3\U00a7\U05f3\U2122\U05f3\U009d \U05f3\U2022\U05f3\U009e\U05f3\U00d7\U05f3\U00a0\"\U05f3\U00a1\U05f3\U2122\U05f3\U009d \U05d2\U20ac\U201c \U05f3\U2122\U05f3\U00a9 \U05f3\U009c\U05f3\U201d\U05f3\U2022\U05f3\U00a8\U05f3\U2122\U05f3\U201c \U05f3\U009e\U05f3\U201d\U05f3\U2122\U05f3\U00a8\U05f3\U2022\U05f3\U00d7, \U05f3\U2019\U05f3\U009d \U05f3\U203a\U05f3\U00a9\U05f3\U201d\U05f3\U203a\U05f3\U2018\U05f3\U2122\U05f3\U00a9 \U05f3\U20aa\U05f3\U00a0\U05f3\U2022\U05f3\U2122. \U05f3\U2018\U05f3\U201d\U05f3\U2019\U05f3\U2122\U05f3\U00a2\U05f3\U203a\U05f3\U009d \U05f3\U009c\U05f3\U009e\U05f3\U00a2\U05f3\U2018\U05f3\U00a8 \U05f3\U2014\U05f3\U00a6\U05f3\U2122\U05f3\U2122\U05f3\U201d \U05d2\U20ac\U201c \U05f3\U0090\U05f3\U20aa\U05f3\U00a9\U05f3\U00a8\U05f3\U2022 \U05f3\U00d7\U05f3\U009e\U05f3\U2122\U05f3\U201c \U05f3\U2014\U05f3\U00a6\U05f3\U2122\U05f3\U2122\U05f3\U201d \U05f3\U009c\U05f3\U2122\U05f3\U009c\U05f3\U201c \U05f3\U201d\U05f3\U009e\U05f3\U2018\U05f3\U00a7\U05f3\U00a9 \U05f3\U009c\U05f3\U2014\U05f3\U00a6\U05f3\U2022\U05f3\U00d7. \U05f3\U201d\U05f3\U2122\U05f3\U2022 \U05f3\U201c\U05f3\U00a8\U05f3\U2022\U05f3\U203a\U05f3\U2122\U05f3\U009d, \U05f3\U00a2\U05f3\U00a8\U05f3\U00a0\U05f3\U2122\U05f3\U2122\U05f3\U009d \U05f3\U2022\U05f3\U009e\U05f3\U00a8\U05f3\U2022\U05f3\U203a\U05f3\U2013\U05f3\U2122\U05f3\U009d, \U05f3\U2022\U05f3\U2014\U05f3\U20aa\U05f3\U00a9\U05f3\U2022 \U05f3\U0090\U05f3\U00d7\U05f3\U009d \U05f3\U0090\U05f3\U00d7 \U05f3\U201d\U05f3\U2122\U05f3\U009c\U05f3\U201c\U05f3\U2122\U05f3\U009d \U05f3\U201d\U05f3\U00a2\U05f3\U00a9\U05f3\U2022\U05f3\U2122\U05f3\U2122\U05f3\U009d \U05f3\U009c\U05f3\U201d\U05f3\U00d7\U05f3\U20aa\U05f3\U00a8\U05f3\U00a5 \U05f3\U009c\U05f3\U203a\U05f3\U2018\U05f3\U2122\U05f3\U00a9.\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0 \U00a0\U05f3\U00a2\U05f3\U2022\U05f3\U00a8\U05f3\U009a \U05f3\U201c\U05f3\U2122\U05f3\U2022\U05f3\U2022\U05f3\U2014\U05f3\U2122 \U05f3\U201d\U05f3\U00d7\U05f3\U00a0\U05f3\U2022\U05f3\U00a2\U05f3\U201d: \U05f3\U009e\U05f3\U2022\U05f3\U00a8 \U05f3\U00a0\U05f3\U00a2\U05f3\U009e\U05f3\U009f.";
nodeName = text;
}
2013-01-25 18:44:26.392 HTMLParsing[2450:c07] ׳׳¢׳•׳“׳›׳ ׳׳©׳¢׳”: 18:35                                      ׳”׳×׳ ׳•׳¢׳” ׳–׳•׳¨׳׳×.                                      ***                   ׳׳׳¡׳™׳¨׳× ׳•׳׳§׳‘׳׳× ׳“׳™׳•׳•׳—׳™׳ ׳•׳×׳–׳׳•׳ ׳™׳ ׳—׳™׳™׳’׳•: 918 - 800 - 1-800                                      ׳‘׳ ׳¡׳™׳¢׳” ׳‘׳§׳¨׳‘׳× ׳‘׳×׳™ ׳¡׳₪׳¨, ׳’׳™׳ ׳•׳× ׳׳©׳—׳§׳™׳ ׳•׳׳×׳ "׳¡׳™׳ ג€“ ׳™׳© ׳׳”׳•׳¨׳™׳“ ׳׳”׳™׳¨׳•׳×, ׳’׳ ׳›׳©׳”׳›׳‘׳™׳© ׳₪׳ ׳•׳™. ׳‘׳”׳’׳™׳¢׳›׳ ׳׳׳¢׳‘׳¨ ׳—׳¦׳™׳™׳” ג€“ ׳׳₪׳©׳¨׳• ׳×׳׳™׳“ ׳—׳¦׳™׳™׳” ׳׳™׳׳“ ׳”׳׳‘׳§׳© ׳׳—׳¦׳•׳×. ׳”׳™׳• ׳“׳¨׳•׳›׳™׳, ׳¢׳¨׳ ׳™׳™׳ ׳•׳׳¨׳•׳›׳–׳™׳, ׳•׳—׳₪׳©׳• ׳׳×׳ ׳׳× ׳”׳™׳׳“׳™׳ ׳”׳¢׳©׳•׳™׳™׳ ׳׳”׳×׳₪׳¨׳¥ ׳׳›׳‘׳™׳©.                                      ׳¢׳•׳¨׳ ׳“׳™׳•׳•׳—׳™ ׳”׳×׳ ׳•׳¢׳”: ׳׳•׳¨ ׳ ׳¢׳׳.
2013-01-25 18:44:27.358 HTMLParsing[2450:c07] ׳׳¢׳•׳“׳›׳ ׳׳©׳¢׳”: 18:35                                      ׳”׳×׳ ׳•׳¢׳” ׳–׳•׳¨׳׳×.                                      ***                   ׳׳׳¡׳™׳¨׳× ׳•׳׳§׳‘׳׳× ׳“׳™׳•׳•׳—׳™׳ ׳•׳×׳–׳׳•׳ ׳™׳ ׳—׳™׳™׳’׳•: 918 - 800 - 1-800                                      ׳‘׳ ׳¡׳™׳¢׳” ׳‘׳§׳¨׳‘׳× ׳‘׳×׳™ ׳¡׳₪׳¨, ׳’׳™׳ ׳•׳× ׳׳©׳—׳§׳™׳ ׳•׳׳×׳ "׳¡׳™׳ ג€“ ׳™׳© ׳׳”׳•׳¨׳™׳“ ׳׳”׳™׳¨׳•׳×, ׳’׳ ׳›׳©׳”׳›׳‘׳™׳© ׳₪׳ ׳•׳™. ׳‘׳”׳’׳™׳¢׳›׳ ׳׳׳¢׳‘׳¨ ׳—׳¦׳™׳™׳” ג€“ ׳׳₪׳©׳¨׳• ׳×׳׳™׳“ ׳—׳¦׳™׳™׳” ׳׳™׳׳“ ׳”׳׳‘׳§׳© ׳׳—׳¦׳•׳×. ׳”׳™׳• ׳“׳¨׳•׳›׳™׳, ׳¢׳¨׳ ׳™׳™׳ ׳•׳׳¨׳•׳›׳–׳™׳, ׳•׳—׳₪׳©׳• ׳׳×׳ ׳׳× ׳”׳™׳׳“׳™׳ ׳”׳¢׳©׳•׳™׳™׳ ׳׳”׳×׳₪׳¨׳¥ ׳׳›׳‘׳™׳©.                                      ׳¢׳•׳¨׳ ׳“׳™׳•׳•׳—׳™ ׳”׳×׳ ׳•׳¢׳”: ׳׳•׳¨ ׳ ׳¢׳׳.

I tried using different encodings with no luck.

edit:

After using:

NSString *string = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
NSLog(@"%@", string);

I see in the log the text written as it should, now how can I convert it to NSData without losing the encoding?

I need to pass it to the HTMLParser.

edit(2):

What worked for me was to convert the NSData to NSString and back using the right encoding:

NSString *encodedStringData = [[NSString alloc] initWithData:reportsHtmlData encoding:NSUTF8StringEncoding];
NSData *reportsHtmlDataEncoded = [encodedStringData dataUsingEncoding:CFStringConvertEncodingToNSStringEncoding (kCFStringEncodingWindowsHebrew)]; 

Thanks for your help.

oridahan
  • 574
  • 1
  • 7
  • 20
  • How do you populate the `NSString` objects? You use the encoding defined in the HTTP Response? – trojanfoe Jan 25 '13 at 16:48
  • I don't know how to find what encoding defined in the HTTP response, thanks. – oridahan Jan 25 '13 at 16:50
  • It's the `Content-Encoding` header: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields. You **must** use this value to set the `NSString` objects. – trojanfoe Jan 25 '13 at 16:56
  • I see that the encoding is windows-1255. content="text/html; charset=windows-1255" /> – oridahan Jan 25 '13 at 18:09
  • "how can I convert it to NSData without losing the encoding?" It sounds like it started out as UTF8 encoded data. – Mathew Jan 25 '13 at 18:39

2 Answers2

0

Maybe this can help you - NSSstring encoding not found

The answer clarifies that NSString doesn't support Windows Hebrew encoding but CFString does. I don't know exactly the encoding the webpage uses since you don't mention it, but hopefully you can give this a try.

Community
  • 1
  • 1
jakobhans
  • 826
  • 7
  • 16
  • Thanks, I have tried changing the encoding using CFString to kCFStringEncodingWindowsHebrew with no luck, I also don't know how to find the encoding the webpage uses. – oridahan Jan 25 '13 at 16:55
0

If initWithData:encoding: works when you declare the data as UTF-8, then the source text is probably encoded as UTF-8. If the Content-Encoding headers tell you otherwise, they could be wrong. Unfortunately, sometimes the headers are wrong.

To answer the question, "How can I convert it to NSData without losing the encoding?"

You can't.

NSData is raw bytes. An encoding is just a strategy for interpreting them. NSData and NSString are both containers for a sequence of bytes; the difference is that NSString also carries around the encoding information, so you can work with characters (which may be one, two, or three bytes) instead of bytes directly.

I'm not sure what HTMLParser you are referring to. If it takes raw bytes (NSData) then you will need to tell it to use UTF-8 encoding. If it takes a string (NSString), then you can just pass it the newly created string.

benzado
  • 82,288
  • 22
  • 110
  • 138