0

I'm trying to replace some escaped unicode in an NSString. I haven't had any luck with the CFString functions, so I thought I would try regular expressions.

Here is the regex

NSRegularExpression *regexUnicode2 = [NSRegularExpression regularExpressionWithPattern:@"(\\u([0-9A-Fa-f]){4}){2}" options:0 error:&error];

Then I try to get matches using this

NSArray *twoEscapeArray = [regexUnicode2 matchesInString:selfCopy options:0 range:NSMakeRange(0, self.length)];

selfCopy is a mutable copy of the input string. Here is a piece of that string:

muestran al p\u00c3\u00bablico las encuadernaciones de las colecciones reales adem\u00c3\u00a1s de otros objetos hist\u00c3\u00b3ricos en relaci\u00c3\u00b3n con \u00c3\u00a9stas.

La muestra, considerada a nivel mundial como uno de los conjuntos ligatorios hist\u00c3\u00b3ricos m\u00c3\u00a1s importantes, se completa con obras de arte como armas, alfombras y relojes. Estos son objetos que ayudan a entender la encuadernaci\u00c3\u00b3n como elemento fundamental de la cultura de corte.

Los fondos de la Real Biblioteca, del Real Monasterio de San Lorenzo de El Escorial, del Monasterio de Santa Mar\u00c3\u00ada la Real de las Huelgas de Burgos, del Monasterio de las

Without proper conversion, these escaped unicode pairs are being treated as individual characters (each pair produces two characters) when I put them into a UIWebView.

This is how the raw JSON data is coded, and I haven't had any luck getting it to convert to Latin characters properly.

Anyway, the problem is that the array twoEscapeArray is nil after the match attempt. I'm not sure why.

Jim
  • 5,940
  • 9
  • 44
  • 91

1 Answers1

0

You mean \u00c3\u00ba is getting converted to ú? That looks like the correct behavior to me. The real question is how those Unicode escapes got in there. It looks like the text was decoded incorrectly at some point (possibly when the NSString was created?), and what should have been the two-byte UTF-8 encoding of the letter ú (U+00FA, Latin Small Letter U With Acute) was decoded as two characters.

Try going back to where you created the NSString, this time specifying UTF-8 as the encoding.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • I added the code that converts the original data to a NSString. My understanding is that \u00c3\u00b3 is valid for the single character oacute ó. (As shown here: http://cpansearch.perl.org/src/JANPAZ/Cstools-3.42/Cz/Cstocs/enc/utf8.enc – Jim May 02 '12 at 15:45