0

I have as an input string tat thinks is UTF-8 but is not and need to fix it. The code is in ruby 2 so iconv is no more and encode or force_encode are not working as intended:

[5] pry(main)> a='zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[6] pry(main)> a.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[8] pry(main)> a.encode!(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[10] pry(main)> a.force_encoding(Encoding::UTF_8)
=> "zg\\u0142oszeniem"

How can I fix it?

pkoltermann
  • 113
  • 1
  • 6
  • Single-quoted strings don't process escape sequences (such as `\uXXXX`). Did you mean `"zg\u0142oszeniem"`? This outputs `"zgłoszeniem"` – Sergio Tulentsev Mar 17 '18 at 11:45
  • So your string contains six literal chars, instead of one unicode codepoint. `["\\", "u", "0", "1", "4", "2"]` – Sergio Tulentsev Mar 17 '18 at 11:47
  • To give more context: I have a test extending "ActionController::TestCase" where the broken string is being returned by "response.body". When assertion compares it to sample utf8 string it fails and in output I get the text from the question. – pkoltermann Mar 17 '18 at 11:50
  • Well, we need a [mcve] here. Short of you using the wrong quotes (like above), I can't, off the top of my head, tell how a string could "unfold" like this. – Sergio Tulentsev Mar 17 '18 at 11:56
  • "and need to fix it" - you forgot to define what "fix" means. – Sergio Tulentsev Mar 17 '18 at 11:56
  • So maybe from other side: My component is being fed by this string and I need to parse it to have propper utf8 chars. Is there any way to do it? – pkoltermann Mar 17 '18 at 12:07

1 Answers1

1

Here's solution using regex:

a.gsub(/\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) } 1

It should work for that particular string:

[1] pry(main)> before = 'zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[2] pry(main)> before.split('')
=> ["z", "g", "\\", "u", "0", "1", "4", "2", "o", "s", "z", "e", "n", "i", "e", "m"]
[3] pry(main)> after = before.gsub(/\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) }
=> "zgłoszeniem"
[4] pry(main)> after.split('')
=> ["z", "g", "ł", "o", "s", "z", "e", "n", "i", "e", "m"]

[1] Unicode codepoints can range from 0 to 10FFFF16 (definition D9 in Section 3.4, Characters and Encoding), that should explains why above regex looks like that.

Maciej Nędza
  • 372
  • 2
  • 11